Amazon S3 Batch Operations

zacharyozer · on May 1, 2019

Batch delete, batch delete, wherefore art thou batch delete?

meritt · on May 1, 2019

It's not exactly what you're asking for, but we have a large bucket with billions of files (don't ever do this, it was a terrible idea) and we manage deletions via lifecycle rules. If your file naming convention and data retention policy permits it, far easier than calling delete with 1,000 keys at a time.

Also just a word of warning, if you do have a lot of files, and you're thinking "let's transition them to glacier", don't do it. The transfer cost from S3->Glacier is absolutely insane ($0.05 per 1,000 objects). I managed to generate $11k worth of charges doing a "small" test of 218M files and a lifecycle policy. Only use glacier for large individual files.

[1] https://docs.aws.amazon.com/AmazonS3/latest/user-guide/creat...

toomuchtodo · on May 1, 2019

I have to ask: what’s performance like for operations on the bucket objects?

Edit: I ask because AWS suggests a key naming convention for large object amounts to ensure that you're distributing your objects across storage nodes, to prevent bottlenecks.

https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate...

votepaunchy · on May 1, 2019

“This S3 request rate performance increase removes any previous guidance to randomize object prefixes to achieve faster performance. That means you can now use logical or sequential naming patterns in S3 object naming without any performance implications.”

https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3...

DVassallo · on May 1, 2019

No difference for Put, Get and Delete. Don't know about List, but if it degrades it's not significant. I worked with buckets with exabytes of data and billions of objects.

meritt · on May 1, 2019

Never noticed any speed difference due to bucket size. S3 is generally slow anyway (250ms for a write isn't uncommon) but it scales very well and we use it for raw data storage that's not in our critical path, so the latency isn't a problem.

Edit Response: I've always used the partitioning conventions they suggest so not sure what sort of impact you encounter without.

duality · on May 1, 2019

But would a batch delete by any other name smell just as sweet?

cheeze · on May 1, 2019

> don't ever do this, it was a terrible idea

I have billions of files. What do?

meritt · on May 1, 2019

For us, it was due to the relatively high PUT cost if you're storing a large number of small files. We ended up changing our approach and we now store blocks (~10MB archives) instead of individual files into S3. The S3 portion of our AWS bill was previously 50% PUT / 50% long-term storage charges. After the change, we managed to reduce the PUT aspect to nearly $0 and reduced our overall AWS bill by almost 30%, while still storing the same amount of data per month.

e.g. if you write 1 million 10KB files per day to S3, you're looking at $150/mo in PUT costs. If you instead write 1,000 10MB blocks, you're looking at $0.15/mo in PUT costs.

Due to S3's support of HTTP range requests, we can still request individual files without an intermediate layer (though our write layer did slightly increase in complexity) and our GET (and storage) costs are identical.

neonate · on May 1, 2019

Wherefore means why, kind of like "what for". You want just "where", though that breaks the quote.

johnhenry · on May 1, 2019

It's plausible that zacharyozer refers to the batch delete that exists in his or her heart. In which case "wherefore" would be appropriate.

giaour · on May 1, 2019

https://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectd...

chocolatemario · on May 1, 2019

This has a limit of 1000 keys, does not handle redriving failed requests and offers no report of the job.

What you could do is use s3's inventory report feature, give the manifest generated to batch operations and handle the delete logic in a lambda. A lifecycle policy with some tagging could also fit your needs here.

usr1106 · on May 1, 2019

Symptomatic for that business that cost is not mentioned in the whole announcement. I am getting more and more skeptic against all serverless because the cost is really difficult to estimate, plan, and manage. Of course if used right some of these services can be cost-efficient. But in real life not all SW is done right...

If you buy a server and run a poorly architectured system on it you note that it does not perform and need to make changes.

If you use serverless and run a poorly architectured system on it you pay and you need to make changes (after someone noted the bill). Yes, there are cost reports but they are not easy to use and understand. With a performance bottleneck the system limits while you are trying to understand the performance measurements. In the cloud case you are paying while trying to understand what is wrong.

Of course in a big corporation money does not matter to a software developer. But in a small company the bill paid to the cloud provider might have a direct impact on the company being able to pay your salary in the near future.

heavenlyblue · on May 1, 2019

>> If you buy a server and run a poorly architectured system on it you note that it does not perform and need to make changes.

The examples you provide are not equivalent. It’s more like “we have a poorly architected software, so we had to buy 200 dedicated servers, because we didn’t know/couldn’t make it work on 10 of them”.

On cloud you could simply update your software and then downscale. Of course you pay more for flexibility, but please stop with those strawmen.

social_quotient · on April 30, 2019

“Invoking AWS Lambda Functions ... I can invoke a Lambda function for each object, and that Lambda function can programmatically analyze and manipulate each object. ”

Wow thanks!

0xCMP · on May 1, 2019

Oh man, that's what Joyent's Manta could do a while ago. This is nice to have now via AWS

kjeetgill · on May 1, 2019

Or, you know, JUST map for map-reduce/hadoop.

seancoleman · on May 1, 2019

A few months back, I designed a small background system requiring a flat key/value store for tracking large amounts of data (>10GB/day). I was hoping to use S3 as a cheap key/value store, but the lack of batch operations, requiring individual puts, made it performance-prohibitive, so I went with DynamoDB. It's worked out great but I'll always wonder what could have been with S3 if I had batch operations back then.

yazaddaruvala · on May 1, 2019

This is a different type of “batch”.

You’re talking about API level batch calls. This is about simplifying workflows which rely on Listing every object in S3 and doing “something”.

moes_dev · on May 1, 2019

I was hoping to use this for moving large video files to a different prefix, but just spotted a limitation of the PUT Object Copy - "Objects to be copied can be up to 5 GB in size."

Cool feature otherwise.