I doubt there’s a single byte stored to disk that is not compressed and encrypte...

IMSAI8080 · on Aug 20, 2022

I don't think it's so clear cut. They have to pay to compress it. If the data the customer stores is short lived it may not be worth it to them. They don't know if the customer already compressed it so they might be wasting their CPU. They also have to pay to decompress it on every access. They allow you to slice an arbitrary byte range out of an object which is technically harder to implement on a compressed file. They charge by the GB and are not exactly super cheap so if the customer wants to store a big fat file of easily compressible zeros then whatever, they got their money.

It might make more sense on their "deep archive" product maybe where the customer has to commit to a minimum storage retention and also pay a retrieval charge which scales with the amount of data recovered (hence paying for the CPU to decompress).

Spooky23 · on Aug 20, 2022

I don’t work at AWS, but storage at scale is a funny beast, usually you’re constrained by IOPS, and if anything you have a surplus of CPU.

If you can stuff more bits in an IO operation, you’re winning.

natmaka · on Aug 20, 2022

Moreover zstd quite unusual '--adapt' parameter enables it to "dynamically adapt compression level to perceived I/O conditions". Works for me (albeit the manpage states that "it can remain stuck at low speed when combined with multiple worker threads").

metadat · on Aug 20, 2022

Too bad the flag doesn't come with detection for this environmental condition and then coordinate accordingly across processes.

thecleaner · on Aug 20, 2022

Is there a paper on how it "perceives" the I/O conditions?

alright2565 · on Aug 20, 2022

I'd guess by using backpressure.

Modify the compression level to try and keep the output buffer at 60% full.

natmaka · on Aug 21, 2022

It reacts to the input buffer state (in bad I/O conditions it starves).

On Linux PSI ( /proc/pressure/io ) probably provides more accurate information (the code already uses /proc/cpuinfo ).

Detail: in the fileio.c module there are lines such as:

if (oldIPos == inBuff.pos) inputBlocked++; /* input buffer is full and can't take any more : input speed is faster than consumption rate /

if ( (inputBlocked > inputPresented / 8) / input is waiting often, because input buffers is full : compression or output too slow */

This impacts a 'speedChange' variable.

Its potential values (an enum) are 'noChange', 'slower', and 'faster'. They are processed rather simply: if (speedChange == slower) { ((...)) compressionLevel ++;

if (speedChange == faster) { ((...)) compressionLevel --;

treffer · on Aug 20, 2022

I think it is a clear cut, mostly because I do not think that compression compromises any of those features, all while making the user experience better.

For any storage system like this you usually have a few bottlenecks. IO and Network are the obvious ones, followed by tiering (cache, fast io, slow io, ...) and at the very end CPU.

Now let's say network is your bottleneck. If you can send the data to the client in a compressed for then you get the compression ratio as additional bandwidth. And the user would get the data quicker! So compression to the network is a clear win.

But the common bottleneck is often IO, a high end SSDs with 1M IOPS at 4KB would _theoretically_ serve 4GB/s, a 40GBit link. That's without any redundancy over other overhead.

Again compression to the storage layer would decrease the total amount of IOs, thus making sure a customer gets data quicker.

Ok, let's say both are not the issue. The fastest compression algorithms compete with memcopy. So if you need just one copy of your data you might have been faster by compressing it.

Especially fast compression algorithms (zstd, lz4, snappy, lzo, ...) are worth the CPU cost with virtually no downsides. The problem is finding the right sweet spot that reduces the current bottleneck without creating a CPU bottleneck, but zstd offers the greatest flexibility there, too.

Oh for range requests.... Those large objects are likely split anyway, for easier error recovery (imagine 100MB into a 1GB transfer you notice that the file data was corrupted - not good). Once you work on blocks it's easy to do somewhat efficient range requests again.

dylan604 · on Aug 20, 2022

How clear cut is it when I'm storing a bunch of compressed video files? It's totally a waste at that point to even attempt to compress these files.

lazide · on Aug 20, 2022

What I've personally implemented is trial compression with heuristics (you eagerly compress chunks, and if enough chunks don't compress, stop trying). It does require low level input/output control and per-chunk/block compression.

That said, a surprising number of video sources use sparse file type setups, and I've gotten pretty good compression (up to 60%) using LZ4 with NVR files from some brands.

treffer · on Aug 22, 2022

This is not how such storage systems work. If the whole storage system is for you and you only store compressed video files on it then perhaps.

But we are talking about multi-tenant systems here. You will get better performance and/or better prices if the system finds compressible data for other customers.

All the benefits hold, even for you with non compressible data, if there is an overall benefit.

Hitting a bottleneck less often then this will reduce your tail latencies, too.

It is just that _your files_ won't contribute to these improvements. But you get all the benefits as well.

goodpoint · on Aug 20, 2022

No, most compressed storage systems do not waste any CPU on data that is not compressible.

treffer · on Aug 21, 2022

Having a big CPU (which you need to get many PCIe lanes) and then not using it is waste.

It would be waste to _not_ use the CPU time.

Granted, the CPU powet consumption would drop with enabled power management, but that's only marginal gains as the CPU is likely busy anyway and the total duration of busy time might drop (race to sleep)

I have used this inverted logic to great extends. E.g. when expanding data center capacity I was usually swapping old harware for new hardware once it arrived. It is way better to have the older generations as spares and use the new capacity at the most utilized places.

Do not waste the things you pay for!

goodpoint · on Aug 22, 2022

That's not the topic of the thread.

treffer · on Aug 22, 2022

Ah, I thought you meant that it is a waste to do it, but yeah, most of the time a heuristic makes sure you don't waste anything

miohtama · on Aug 20, 2022

Video files have already entropy coding applied to them and thus any compression gains with reapplying a generic entropy coding like zstd are unlikely:

https://en.wikipedia.org/wiki/Entropy_coding

sicp-enjoyer · on Aug 20, 2022

That's exactly what the poster is saying.

erk__ · on Aug 20, 2022

They could be using hardware compression which can be orders of magnitude faster than doing it on the CPU.

Hardware compression is sadly not widely available, I think the only consumer product I know with it is the PlayStation 5.

The mainframes from IBM have had hardware zlib since Z14 iirc and in my small tests it is very fast compared to the CPU implementation

estebarb · on Aug 20, 2022

Hardware compression IS available in Graviton 2: "1Tbit/s of compression accelerators • 2xlarge and larger instances will have a compression device • DPDK and Linux kernel drivers will be available ahead of GA • Data compression at up to 15GB/s and decompression at up to 11GB/"

sexy_panda · on Aug 20, 2022

I wonder if this could be done on a FPGA..

vasco · on Aug 20, 2022

FPGA for development, AWS can afford to order custom ICs.

toomuchtodo · on Aug 21, 2022

Could surplus crypto miner GPUs be repurposed for this?

vasco · on Aug 21, 2022

Frost1x · on Aug 20, 2022

With enough gates you can do anything on an FPGA

pclmulqdq · on Aug 20, 2022

How much would someone pay for this? I have a half-written zstd core, but I doubt the market for $100-150 FPGA-based compression accelerators is all that large.

tgsovlerkhgsel · on Aug 20, 2022

Probably not worth it as a FPGA solution or even in general as an add-on card (the overhead of dealing with such extra hardware means that the threshold for "worth it" is very high).

I would expect this to become part of newer generations of CPUs once it becomes popular.

wmf · on Aug 20, 2022

Intel sells a gzip PCIe card called QAT. Not many seem to be sold.

NelsonMinar · on Aug 20, 2022

I'd never heard of this! The full name is QuickAssist and it does encryption too. They advertise 100Gb/s symmetric crypto, 70Gb/s compression (or roughly 100x faster than ZStandard on a single CPU). Seems to retail for about $650 for a card.

sexy_panda · on Aug 28, 2022

It seems like QuickAssist can also be used for Transcoding. Plex e.g. supports it.

My Intel i9 11900k seems to support it or has it built-in.

pclmulqdq · on Aug 20, 2022

QAT is actually likely to end up inside new server CPUs from Intel - at least according to the advertising material. Also, it is in their new SmartNICs. At least somebody is using it.

zxcvbn4038 · on Aug 20, 2022

It has been done before, if you offload gzip you turn your PCI bus into a choke point. Most of the time you do better keeping it on the main processor.

sigmoid10 · on Aug 20, 2022

I think a lot if datacenter SSDs already come with in-drive hardware compression these days, since it not only increases speed but also longevity. So it would actually save money anyways.

snoopy_telex · on Aug 20, 2022

They do not. It would be difficult to plan correctly if your free disk space is… variable.

Example:

You have an existing 40 gigabyte file

It happened to compress well

You delete it and your free disk space goes up by 4 gigabytes.

You then write a new 40 gigabyte file that doesn’t compress well

Replacing an existing file of the same size just ate an extra 36 gigabytes.

How would you plan around that? SSDs should store the bytes given and don’t play fancy games.

sp332 · on Aug 20, 2022

It's not great, but it's also not unheard of. Tape capacities are often quoted at double the actual storage space, with fine print that says "assuming 50% compression". Also, if compression makes IOPs faster or reduces wear on SSDs, people might not complain so much.

SideQuark · on Aug 20, 2022

SSDs could still do it for speed and to write less pages, making the drive last longer, and simply report the uncompressed space as used. They already do all sorts of tricks on pages such as moving them logically, having more internally than they report to use as pages wear out, and so on.

Given that, it's good sense to compress if at all possible simply to make the drive live longer.

And guess what - I just googled, tons of hits, and this has been done for a long time :)

So it makes sense, is done, and is important for modern SSD behavior.

tgsovlerkhgsel · on Aug 20, 2022

None of this works if the data is encrypted.

I'm surprised (and shocked) that letting unencrypted data hit the disk is still common enough to make such optimizations worth it.

Even if you just stick the key in the server's TPM without any sealing, an encrypted disk makes it much easier to deal with e.g. drive returns (for warranty or fault analysis) or disposal.

goodpoint · on Aug 20, 2022

> I'm surprised (and shocked) that letting unencrypted data hit the disk is still common enough

There's very, very little benefit in encrypting data at a filesystem level in a datacenter if you think about it.

tgsovlerkhgsel · on Aug 25, 2022

I explained why I think it does provide a benefit even in a datacenter:

"Even if you just stick the key in the server's TPM without any sealing, an encrypted disk makes it much easier to deal with e.g. drive returns (for warranty or fault analysis) or disposal."

Would you resell usable drives that you no longer want to use (e.g. because they're too small or your needs changed more towards SSDs) if unencrypted data was written to them at some point?

How much effort would you put into making sure that no broken drive that can't be wiped leaves the datacenter without shredding?

Any process you put in place will have gaps - e.g. through human error or malicious acts - and this provides pretty solid protection against that.

staticassertion · on Aug 20, 2022

Unfortunately that's going to really depend. For example, if your threat model is "hard drive gets stolen" there's no point. If your threat model is "attacker can access my database" encrypting the data at the DB level does make sense. But it obviously breaks compression.

And unfortunately compression and encryption are seemingly at odds fundamentally :c

SideQuark · on Aug 20, 2022

Plenty of SSDs implement strong encryption, and there's an entire standard API for it. Why waste cpu on tasks your hardware already does?

Bitlocker supports on drive hardware encryption, and I'd be surprised if other major file systems didn't.

If I recall, it's a FIPS requirement for data at rest now.

gspr · on Aug 21, 2022

> Why waste cpu on tasks your hardware already does?

Because the storage hardware might be loyal to someone who is not the owner of the data.

tgsovlerkhgsel · on Aug 25, 2022

> Why waste cpu on tasks your hardware already does?

Because the hardware cannot be trusted to do it correctly. IIRC Bitlocker stopped relying on it for this reason.

https://www.howtogeek.com/fyi/you-cant-trust-bitlocker-to-en... (see the updates)

sigmoid10 · on Aug 21, 2022

Yes they do. Intel writes this for example (not all drives use it, but some definitely do):

>Data compression via encoding algorithms enables a solid state drive (SSD) to write less data, which in turn yields higher write bandwidth. With a significant amount of data being compressible, performance benefits can be substantial.

https://www.intel.com/content/www/us/en/support/articles/000...

anamexis · on Aug 20, 2022

Yes they do.

https://www.intel.com/content/www/us/en/support/articles/000...

wmf · on Aug 20, 2022

Note that these are pre-2017 consumer SSDs. I think SSD compression fell out of favor due to the rise of FDE.

SideQuark · on Aug 20, 2022

FDE can be done on the SSD after compression, and often is.

Adding life to SSDs is a terribly useful feature

natmaka · on Aug 20, 2022

Isn't it solved by adding an abstraction layer between the storage and the filesystem, the way "vdo" ( https://github.com/dm-vdo/vdo ) does it?

yunohn · on Aug 20, 2022

What do you mean? Even without compression, you have half the problem. Obviously, at scale, it’s all just statistics and planning on trends instead of individual files.

collegeburner · on Aug 20, 2022

how is this any different than e.g. fs level compression with zstd on btrfs? seems enough people find that useful.

Havoc · on Aug 20, 2022

>. They charge by the GB and are not exactly super cheap so if the customer wants to store a big fat file of easily compressible zeros then whatever, they got their money.

Maybe they charge at uncompressed rate but store it at compressed? Then they got even more money!

notimetorelax · on Aug 20, 2022

This was true a few years back, nowadays it’s cheaper and faster to compress the data at rest as the bottleneck is frequently IO and storage space. Both, in terms of capacity and cost.

microtonal · on Aug 20, 2022

This was true a few years back

Only temporarily with SSDs. With spinning rust, it also often paid off to compress data. We'd store large treebanks compressed, because decompression was much faster than disk reads.

LinAGKar · on Aug 20, 2022

It would still produce some CPU overhead, and thus some energy usage.

IntelMiner · on Aug 20, 2022

Presumably it's the tradeoff of CPU overhead versus disk and bandwidth (larger files take longer to copy into memory, which is also energy usage. And more bandwidth to shunt around Amazon's own network)

SuperQue · on Aug 20, 2022

There's also a latency component.

Since CPUs are fast enough to deflate in real-time now, your bottleneck for a read is your storage/network.

Reducing the bytes read from storage improves the IO latency.

blibble · on Aug 20, 2022

> They allow you to slice an arbitrary byte range out of an object which is technically harder to implement on a compressed file.

this is pretty easy, you flush the compression buffer every megabyte or so and maintain an index

maybe 50 lines of code

jeffffff · on Aug 20, 2022

Sure, but now you've added an extra layer of indirection which can have a significant impact on performance

klauspost · on Aug 20, 2022

It doesn't really have to impact performance. The index is generated easily as a side-effect of compression. And the index is only needed if you need to seek.

I implemented this as part of the MinIO server. See "Seeking Compressed Files" here: https://blog.min.io/transparent-data-compression/

We choose a compressor without literal compression for a faster baseline, but the concept remains the same.

jeffffff · on Aug 20, 2022

But if you do need to seek, which is really common in data warehouse workloads for example, unless you keep the index in ram you have to do an extra IO on every seek to read the index

blibble · on Aug 20, 2022

there's always going to be some metadata for the file that needs to be looked up before you can start seeking (ACLs, sector/extent/cluster location, etc)

the index goes in there, no extra seek needed

jeffffff · on Aug 21, 2022

yeah that isn't free either, it adds significant bloat to your metadata. with most enterprise customers encrypting and/or compressing data before putting it into s3, it doesn't seem like there would be much benefit. s3 really isn't the right layer to implement compression. filesystems aren't either. it's better to leave it up to the application.

blibble · on Aug 21, 2022

> yeah that isn't free either, it adds significant bloat to your metadata

yeah, 4 bytes for every megabyte

> s3 really isn't the right layer to implement compression. filesystems aren't either. it's better to leave it up to the application.

yeah, I'm sure you're right and Amazon have absolutely no idea what they're doing and like to spend unnecessary CPU cycles doing pointless work and add "significant bloat" to their metadata

... or, you're wrong (like in every previous comment in this chain)

jeffffff · on Aug 21, 2022

https://www.reddit.com/r/programming/comments/wtd61q/aws_swi...

this tweet is not talking about compressing customer data in s3, i seriously doubt that aws compresses customer data in s3 for all the reasons i've already listed. i am right and amazon does know what they're doing, which is why they don't compress customer data in s3.

4 bytes per megabyte becomes significant at scale when you have to keep it in ram, which you have to do if you want to avoid the extra IO.

klauspost · on Aug 22, 2022

You only need a single part to calculate a specific offset, assuming you have part sizes stored in metadata already (a good idea).

Each part can be max 5GiB as per S3 spec. 5120 * 4 = 20KiB.

Even if you unpack to 8*2 bytes in memory when decoding, you are still not talking a huge amount of memory.

The on-disk space is ~0.0004% as blibble calculated, and should easily be offset by the compression achieved. In MinIO we don't store indexes for files < 8MiB, so for small files there is no overhead.

If the added metadata is a problem for whatever system you are looking at, then that is a characteristic of that system and not a general problem.

blibble · on Aug 21, 2022

ah yes, "authoritative" comments from random reddit accounts

and you don't understand the algorithm if you think you need to keep the index in RAM, because you don't

jeffffff · on Aug 21, 2022

if it's not in ram, you have to do an extra IO to look it up. i don't think you understand how precious metadata space is in a large scale storage system. if you pollute the metadata cache with useless junk like this, you can't cache as many things, your hit rate goes down, and you have to do more IO operations to service each request on average. name one popular distributed file system or object store that compresses everything by default like you are claiming. you won't be able to, because none of them do it, because it's better to leave it to the application.

blibble · on Aug 21, 2022

> if it's not in ram, you have to do an extra IO to look it up

as has been explained to you several times, you don't

> if you pollute the metadata cache with useless junk like this

the overhead is 0.0004% with 1 index entry per megabyte, and if that's too much that can be reduced by 10/100/1000/10000x that by changing the size

as we're clearly now going around in circles, I won't be responding again.

jeffffff · on Aug 22, 2022

well i tried. i wish you the best of luck in your continued ignorance.

blibble · on Aug 22, 2022

the irony

eru · on Aug 20, 2022

> They charge by the GB and are not exactly super cheap so if the customer wants to store a big fat file of easily compressible zeros then whatever, they got their money.

That's not a good argument: they could lower their costs with compression, still charge the same, and make more profit.

mekster · on Aug 20, 2022

Why do they have to either compress it all or not. They must be smart like, have the files split in pieces (just like some network file systems/backups do) and if those blocks are untouched for a while, compress what's compressible and leave them as is during active uses.

oogali · on Aug 20, 2022

I think the point of the different storage tiers of AWS S3 is to get customers to classify their own data, then AWS can pick the right mix of hardware, software, and compute that satisfies AWS’s requirements for availability and COGS.

If the difference between standard S3 and S3 Glacier was just slower disk, then rate limiting the customer would suffice.

But if there’s a significant amount of compute thrown at data de-duplication, compression, and indexing, then it starts to clarify why there’s a pricing penalty for using Glacier with the same access patterns as one would use on standard storage.

pythux · on Aug 24, 2022

> They charge by the GB and are not exactly super cheap so if the customer wants to store a big fat file of easily compressible zeros then whatever, they got their money.

And what if they can charge you the uncompressed size and only actually store much smaller compressed files behind the scene. That seems like having your cake and eat it too.

naikrovek · on Aug 20, 2022

I think an FPGA can probably compress quite a bit faster than a general purpose CPU, and an ASIC on a card which is network on one side and storage on the other could compress at line speed easily.

paulsutter · on Aug 20, 2022

Amazon has millions of idle cpus available 24 hours a day (they can use all the idle time for all customer instances for whatever they want)

eru · on Aug 20, 2022

That doesn't make it completely free. They still have opportunity costs.

seabrookmx · on Aug 21, 2022

Idle CPU's also use less power than ones running full tilt.

alexchamberlain · on Aug 20, 2022

+1 if you are storing objects uncompressed, I'd be amazed if AWS doesn't compress them and charge you for the full space anyway

sitkack · on Aug 20, 2022

If this true, there is possibly a side channel one could run against object storage to determine if someone else in the content-addressable-store has the same files.

Like when it was easy to file share on dropbox by having the correct hashes. A GUID could summon a 1GB file.

eurg · on Aug 20, 2022

Compression and content-addressing are two separate things.

Content addressing across accounts on private, AWS encrypted S3 buckets would run counter to their claims.

alexchamberlain · on Aug 20, 2022

A couple of comments across the thread have made similar points, but if I were implementing this, the "client metadata" like the incoming sha256 etc would be implemented a layer higher than the actual byte storage, so the byte storage could be compressed without any impact on that sort of thing.

staticassertion · on Aug 20, 2022

That assumes cross-tenant compression.

smueller1234 · on Aug 20, 2022

It's not quite that simple. If you have customer data that's already encrypted, then compression won't do much because it looks random. But of course by the time you get to your infrastructure layers, that'll be the case (or you really messed up your security story!). Which means you'd have to compress right at the edge. They might be doing that (which would basically mean it's the customer compressing it before they encrypt it with their keys because AWS has no business seeing the clear text), but then you get to compress each item separately, which might not be very effective for small values.

tl;dr:There's a real efficiency/security/insider risk trade-off here.

Edit: I should disclose that I work for a competitor. Don't intend any astroturfing.

notimetorelax · on Aug 20, 2022

I agree with you, there could be scenarios where customers supply their own keys and compress the data on their own. My original statement is still true though, the data at rest ends up being compressed and encrypted.

That said, of course, customers can upload encrypted blobs of uncompressed data. But I’d call it an exception that proves the rule. Here service simplicity should win and those blobs may end up recompressed.

HiJon89 · on Aug 20, 2022

How would that work for something like S3 range requests? Rather than reading an entire object sequentially (which would work fine with transparent compression) you can also ask to read an arbitrary byte range (give me bytes 1,000,000,000-1,000,001,000 from the original file). I guess you could maybe store the compressed file in chunks with metadata about the original byte range inside each chunk.

klauspost · on Aug 20, 2022

For MinIO (an S3 compatible server), we add an index for each part, which contains uncompressed -> compressed offset pairs.

Since we already used a Snappy-derived method, each 1MB block is stored without backreferences. With this we only have to decode at most 1MB-1 extra bytes to respond with a specific range offset.

rcxdude · on Aug 20, 2022

Generally with filesystem-level compression you don't compress an entire multi-GB file: you compress segments of maybe a few 100k. This gives you a very slightly worse compression ratio but allows random seeks to still be efficient.

uluyol · on Aug 20, 2022

I'm sure it's encrypted, but I doubt that they compress everything. Images and video tend not too compress well since they've typically already been aggressively compressed with specialized algorithms. It would just be throwing CPU cycles away.