Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ribbon filter: Practically smaller than Bloom and Xor (fb.com)
216 points by boyter on July 11, 2021 | hide | past | favorite | 40 comments


So the high level idea as I understood it seems to be this:

Use a hash function b(x) to transform keys into fixed length bitvectors.

Now we want to build a function b'(x) such that b'(x) = b(x) for any keys in our set, and such that b'(x) is a random value for other keys. Checking if b' and b agree will be our membership test.

They construct b'(x) as h(x)· Z, where h is a vector with binary entries and Z is a matrix with binary entries.

By taking all the elements we want and computing h(x) for each of them we get a bunch of vectors that we can arrange into a matrix H. Z can be found by gaussian elimination on that matrix.

The name ribbon filter comes from the fact that H is constructed in a particular way that gives it a ribbon shape. That shape makes the whole thing efficient in time and space.


Thank you for the summary, very clear and succinct! It's mentioned in a different comment here, but it might help to clarify at the beginning that (unlike bloom filters) the data structure is fixed (can't add more entries).


So it's offline then?


The application in RocksDB is that it is probably computed when generating an SST file, which is an immutable set of key-value pairs on disk. Therefore it can be saved into the SST file itself and loaded into memory when opening the DB, and then consulted when doing point lookups by key - does this SST file definitely not contain the key?


The benefits this data structure can give you sound kinda cool, even if I've got pretty much no idea how this data structure works based on the blog post. I've seen issues with fast set membership checks before that result from "Bloom filters having not the most pleasant API" - they can get annoying in practice.

As something related HN might find interesting: Ethereum uses Bloom filters internally within each block to allow users to easily watch for specific "events" that are fired off by transactions. So, for example, you can just look at the Bloom filter in a block and check if your address is in it, and this will tell you if certain types of tokens might have been transferred to your address within this block. As a user, this might allow you to not look at all the transactions in a block if you're just trying to watch for transactions relevant to you.

But the system is a bit broken. Because Bloom filters only do probabilistic checks of set membership (so it might say inclusion when there's nothing included), and because Bloom filters use hashing internally, you can pretty easily fill up the entire Bloom filter and potentially minorly DOS a few clients. A friend and I have a write-up about our attack (here.)[1] We ended up deciding it wasn't high severity, but it was still pretty fun.

[1] https://medium.com/@naterush1997/eth-goes-bloom-filling-up-e...)


This behaviour for the bloom filters is intended and not a DOS attack for some client. Imagine they did not use bloom filters, you had to analyze every transaction to see if you are interested in it. The bloom filters just allows in most cases to be absolutely you are _not_ interested in it. You only need to further analyze if the bloom filter says you may be interested in it.


The fact that Bloom filters are so far from the information theoretical limit for the problem they address offers some hope.

I think it’s reasonable to expect that there are many structures out there that can achieve the same API in less space (time?) and that at least a few of those can achieve a better API in the same space, even if it means giving up some gains to store extra metadata.

Example: linked list hashmaps, which store two extra pointers to remember insertion order.


There's a linked paper at the bottom which presumably would help you learn how it works if you are interested

That being said, I haven't read it. Even parts of the blog post's implementation summary were over my head. That paper would be surely more of a "study" than a "read" for me


Link to the full paper on arxiv: https://arxiv.org/abs/2103.02515


Thank you and thank you again for linking to the abstract rather than the pdf


This is an interesting structure. It’s important to note that it’s an immutable structure so it’s more comparable to xor filters rather than bloom in that aspect.

I’ve been looking for a simple but more efficient than bloom structure but many of the alternatives are quite complex (so hard to reimplement/port) or less performant. Still quite interesting. Will need to read this more closely.


Immutable here meaning that all keys need to be known ahead of time of the construction? Certainly that's been one of the nice features of our use of bloom filters - we can safely add keys over time on demand, sharding every hour and giving all callers an up-to-the-second filter when they call. We could do a ribbon/minute and send that series out, but you lose efficiency by forcing checks across eg 24*60 filters.

E: yes, self-confirming that's what's meant. This is a great low-maths explanation from downthread which helped - https://news.ycombinator.com/item?id=27800788


Correct. I've been going deep into this looking for a unicorn implementation friendly to distributed environments:

* Mutable - streaming applications

* Mergable - so you can avoid querying N filters all the time or share slices.

* Scalable - so you don't have to pre-size

* Performant

* Serialize to bytes

* Language - Implemented in multiple languages. Many are in C only which is suboptimal (and often too complicated to easily port).

* Bonus: Thread safe implementation

Typically you can get some of these, but not all. Sometimes its not just possible with the structure (e.g. mutability in xor filter), sometimes its an implementation gap (serialization, etc.)

In my case finding the magical combination in Java has been quite difficult. The only way was to do custom, and relax some requirements (like auto scalable structures).

After doing a lot of testing I ended up with pre-sized bloom filters behind an MPSC model. The tradeoff of having to monitor/pre-size filters is one I'm not happy with but weighed against other tradeoffs it made sense.

Cuckoo was far too slow (and merging was iffy), and I can get a huge speed increase switching and customizing implementations (Guava -> spark -> fastfilter), and also big boost using xx3 (or xxh) rather then murmur which is pretty popular in off the shelf implementations.

Porting CQF, and much of the latest research was just too much (and many implementations lack merging, etc so you have to really understand the structure to do it right).


This appears to be FB’s implementation: https://github.com/facebook/rocksdb/blob/837705ad8011e249d5e...

GPLv2 or Apache 2.0


there is also sources used for benchmark, mentioned in paper. https://github.com/pdillinger/fastfilter_cpp/tree/dev/src/ri...


The description sounds very complicated. I doubt this will ever fit into my brain, which would make me wary to ever use this. It seems more like a "voodoo data structure" like Judy tables, where you have to trust some library and likely there's no practical way to debug it when it ever goes wrong. Perhaps it has nice properties, but it just seems risky to use something like this. In this case I don't even see a link to a library.


I agree with you that I'll probably never use this, at least not in the next few years. But I think it's worth putting the research in context-- they've only been published for 2 days now. Expecting a production-ready library is... unreasonable? The tooling, the documentation, the textbooks, the wiki page, etc aren't going to compare to Bloom filters, which have been used for over 40 years.


"However, we do not implement efficiency gains at all engineering costs, so it’s also important to have a user-friendly data structure. This issue stalled implementation of other Bloom alternatives offering some space savings. "

TFA article shows that they do share your worries, but they believe this approach is indeed simple enough to be worth it.


That’s a separate issue. The paper is concerned with the data structure not having gotchas (i.e. it is performant across a wide, continuous range of configuration and input values) whereas nn3 is concerned about personally understanding the design of the data structure.


The two aspects are intimately tied. A library that implements an algorithm that has few gotchas and can be used intuitively requires less troubleshooting/debugging and understanding.

Sure you need to trust the authors. I'm sure we regularly do such leap of faith all the time we use various software components we surely don't review on a daily basis, not sure why this particular tool should be judged a different standard (provided that is indeed straightforward to operate)


I suspect they just refer to "black box practicality" here. As in they have a nice library for FB developers and the API is simple enough. And if something goes wrong the FB person can contact the author and they will fix it for them. I guess that's practicable enough for them.

I was looking more for "I can understand/implement/debug it myself" practicability, avoiding black boxes.

Even in the FB case there is the bus factor of course. When the author at some point moves on to greener pastures they can only hope that it still fits into the brain of whoever replaces him.


Obacht. They don't even mention the original Bloom filter problem, the constant overhead of calculating 3-5 hashes per key, which is problematic with overly slow hash functions (like siphash) or overlong keys. Then it's faster to go without bloom, xor or ribbon, which just cancels searching for not-included keys.

Or if you know that your key is always included in the dataset, it is pure overhead.


> TFA article

??


Yes A stands for Article, of course :-)


I skimmed the paper expecting to find a comparison to HyperLogLog, but didn’t find one - am I missing something?


A hyperLogLog is for counting distinct elements. This and Bloom filters are more about checking whether an element has been seen before; a very different use case.


Cuckoo filter is the one I thought it would be compared to since I see mentioned on HN a lot: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

And the title seems to be a reference to it too, "Cuckoo Filters, Practically Better Than Bloom"


The paper has a great figure where they illustrate areas of the overhead vs false positive trade-off space where each filter type performs best. Cuckoo filters make an appearance there


People: no need to downvote someone for asking a question.


> we expect Ribbon filters to save several percent of RAM resources, with a tiny increase in CPU usage for some major storage systems.

I would have factored in the code size with it too.


Depending on whether their CPU usage figures include waiting for memory accesses, any savings in RAM usage might be worth it: current systems do have a lot less memory bandwidth than CPU power, so it is preferable to compute more and load/store less. If their using less RAM also translates to using fewer loads/stores and more computations from registers (which they hint at), this optimisation might be a bigger win than it looks at first.


They're working with petabytes of datab in ram. Terabytes per machine. Code size isn't significant by comparison.


I assume he meant code size as in how much a developer would need to read to understand.

Basically using code size as a proxy for code complexity

(which is not always the case of course)


Combining the binary size with the compressed data size is fairly typical when measuring compression schemes. Otherwise you end up in situations where people game benchmarks.

E.g., from [0]:

Compressors are ranked by the compressed size of enwik9 (109 bytes) plus the size of a zip archive containing the decompresser.

[0] http://www.mattmahoney.net/dc/text.html


Code size, if it causes iTLB misses, is a problem. If a program has lots of hot code sparsely distributed that’s an opportunity for optimization. On the other hand I expect Facebook to have a handle on this issue, in light of their BOLT and general maturity of their tool chain.


I have always been fascinated by our ability to make these sorts of tradeoffs, but I also worry that making this tradeoff in the first place is an intermediate compromise taken in favor of a more fundamental rework of a problem space.

If you find yourself rolling dice to figure out if something is in a set or not, you may want to step back and look at how you track membership and identity things.

Determinism (testability) is such a wonderful thing. Sure, you can make these filters appear deterministic in controlled setting (i.e. a recorded log of system inputs), but there's no way you could tell me how the execution would flow in a real world situation with live data wherein you cannot anticipate future events.


Probabilistic data structures are used when certainty is not needed.

Bloom filters for example can serve as a filter: to work only on 50% of the data to effectively speed up your calculations by 100%.

If 50% of your misses are bloom filtered away, you got rid of 50% of your lookups to storage.

---------

In fact, most cryptography only works off of probability. Not only keys or RNG, but also on routine algorithms like finding a prime number.


> In fact, most cryptography only works off of probability. Not only keys or RNG, but also on routine algorithms like finding a prime number.

Usually in cryptography problems when we assume that our assumptions hold (e.g. there's no better method than a semi sophisticated brute force), you get that probability down to a negligible number. And I recall learning that some key generating systems will even have a list of known cracked large primes, and check to make sure what it generates isn't in that list (in case the systems RNG is biased.)


With a tiny bit of extra cost, a very expensive check can sometimes be omitted early.

Imagine your mailbox out in your yard has a slightly faulty indicator: Sometimes when it's ON, there's actually no mail. But you know for a fact that when it's OFF, you definitely have no mail. So you can look out your window to determine whether you might have mail.

If the indicator saves you more unnecessary trips to the mailbox than it causes by failing to indicate, it's a good thing to have overall.

That's what bloom filters do. They sometimes save you needless work. The fact that "sometimes" isn't deterministic doesn't detract from its usefulness.


The probability part of data structures like bloom filters are intrinsic, and are fundamental to the actual reason they are used.

At the set sizes they are useful for, you simply have to make trade offs for performance reasons, and the trade off you are making is explicit.

If you are are using a bloom filter, you know upfront that false positives will happen. That is a fundamental part of the code you will write, it isn’t something you will discover at a later date. You will write tests for false positives. It is well understood what the probabilities are for false positives.

If you can’t live with that in your code, you might not be able to work on the sorts of workloads that bloom filters are good for.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: