I’ve got some code running for the WatchMeCode media service that pulls a list of files from my AWS S3 bucket, and populates a database. The code has been working fine for a while now, but I recently wanted to add a sub-folder… and my code that reads the file list suddenly didn’t find the folders and files that I want.
S3:ListObjects Pages By Default. And Always.
The problem, as it turns out, is that S3 will always page your results for you – even when you don’t want it to. At least, this is my experience and is based on the evidence of having tried every possible way to not get Amazon to page my results.
The code in question uses the aws-sdk for Node, and is fairly simple:
In the response from the call, I get a meta-data structure that includes an “IsTrucated” attribute. If I examine this attribute, it is always set to true if you have a lot of files. That is, S3 will always page your result set (probably to keep the network traffic and memory use reasonable).
This means Amazon is paging the result set for me and I need to continue to make calls back to listObjects method, providing a “Marker” from which to start.
Paging With Markers
If you look at the docs for the aws-sdk, there is a “Markers” parameter we can pass to the listObjects method call. This parameter should either be empty (meaning, start at the beginning of the list), or set to the file Key to specify where the listObjects call should start.
The only problem is, the docs don’t tell you this in the definition of “Marker”. Instead, you have to scroll down to the response definition and look for the “NextMarker” attribute definition, which says this:
If response does not include the NextMaker and it is truncated, you can use the value of the last Key in the response as the marker in the subsequent request to get the next set of object keys.
That being the case, once the listObjects method executes the callback, you will want to check the “IsTruncated” attribute and make yet another call to AWS with the last Key from the result as the new Marker:
Once you have this in place, you are effectively paging through the list of objects in your S3 bucket.
This particular problem may not show itself when you first start using S3. I’ve been using S3 for several years now, and only just ran into this issue, in fact.
But here’s the real kicker: I hit this problem because I have logging turned on in my S3 bucket, and S3 is returning thousands and thousands and tens of hundreds of thousands of files from my listObjects call.
I ran into this because I didn’t organize my files correctly from the start. I should have used sub-folders for everything instead of putting the core files in the root of the bucket.
So, now I get to go fix that and migrate my files so I can filter the logs out of the listObjects call, using the “prefix” attribute.