Perfect Hashes and faster than memcmp

dchichkov · on March 19, 2015

If anyone is interested, I have a cute nearly-minimal perfect hashing algorithm designed to have good cache-friendly properties. It works very well in practice and in my particular application was performing faster and more consistently (no long tail) than anything found in CMPH.

Briefly, it is somewhat similar to hopscotch (or robin-hood) hashing, only you pre-calculate positions of the elements to put them into optimal spots by solving the assignment problem (via Hungarian assignment problem solver or such). Works for up to about 50k elements. It feels like it might have good theoretical properties, might be even optimal (after all, we pretty much calculating optimal positions of the elements based on the costs of our memory/cache accesses and probabilities of each element), but it was a while since I've taken the algorithms class.

If anyone is interested to do a writeup and publish clean source code - you'd be welcome, ping me via e-mail.

phpnode · on March 19, 2015

Your email address isn't listed in your profile (you might have entered it in the "email" box, but that's not visible to other people)

dchichkov · on March 19, 2015

It is dchichkov at gmail...

rurban · on March 19, 2015

I'm interested. Robin-Hood optimization was on my TODO list. Dmitry Chichkov from XAMBALA? I couldn't find your email.

userbinator · on March 19, 2015

we can optimize memcmp away by doing word-size comparison directly. This is 50% - 2000% faster then doing memcmp on each string

This is what a good implementation of memcmp would do anyway (and on x86, one may compare up to 16 bytes at once using SSE), but the problem is that memcmp is generic enough that it must work for all lengths so for short comparisons you lose more cycles in switching between all the special cases than is gained from the comparison itself.

fake-name · on March 19, 2015

For the love of FSM, please don't call your library "phash". "phash" is already in wide use for "perceptual hash", or the hashing mechanisms used for things like fuzzy image searching.

rurban · on March 19, 2015

Oh, I am on HN :)

Okay, point taken. I didn't know about this name. "pperf" maybe then. Note that this is still work in progress, I just had to finish another project, and will come back to this one soon.

EDIT: I've renamed the frontend to "pperf" now https://github.com/rurban/Perfect-Hash/commit/b475886f37d9

But the memcmp trick with unrolling of the comparisons beforehand for constant keys is already usable, and I believe someone already considered that optimization for gcc and llvm. If I remember I even saw a patch already. It should be used for switch statements with constant keys.

octo_t · on March 19, 2015

I think this is the first time I've seen a person on HN be told "oh a product already exists with this name, maybe change it" and they actually have.

Props :)

dpayne · on March 19, 2015

One caveat of perfect hash functions, including CMPH, is that for a very large amount of keys the data structs they use internally become very large. In the case of CMPH when I tried with around 30 million keys the result was several MBs which caused it to not fit in L3 cache. For large dictionaries, hash functions like xxhash and farmhash are usually a better option than the perfect hash functions.

t1m · on March 19, 2015

You can combine perfect hashing with string compression (Huffman, prefix tries) and de-duplication techniques to vastly improve memory usage for a wide variety of use cases. DiscoDB does this to good effect (http://discodb.readthedocs.org/en/latest/).

ot · on March 19, 2015

If you use a fast hash function you still have to look up a hash table, so the memory access is still there. If you use some clever probing strategy you might be able to have only 1 cache line access per lookup on average, while most perfect hashing functions, which use the MWHC scheme, do 3 random accesses.

However, these 3 random accesses are independent, so the CPU can mostly pipeline them, and the perceived latency is that of a single random access.

rurban · on March 19, 2015

Thanks, I know. I used one of the CMPH algorithms for the first three default methods (from pure perl to pure C), -hanov, -hanovpp, and -urban, and those perform much better than the original CMPH. And I haven't enabled compression.

But I still haven't found the best overall method yet. Hashing all keys shouldn't be necessary. There will come more later.

hurin · on March 19, 2015

I don't know much about the topic but I was under the impression a minimal 1 to 1 mapping is usually possible. ref: http://www.burtleburtle.net/bob/hash/perfect.html

ot · on March 19, 2015

cmph is not the fastest library for MPHF, mostly because it uses a slow ranking function to turn a PHF into a MPHF.

I wrote a small library some time ago that performs much faster, both for in-memory construction and lookups, [1, see the linked paper for benchmarks], unfortunately I have no time to maintain it but I recently found out that some projects are using it, so it wasn't all wasted time :)

[1] https://github.com/ot/emphf

</shameless plug>

rurban · on March 19, 2015

Excellent. I'll take it :)

jheriko · on March 19, 2015

i still fail to see the enormous fuss about hash tables as if they are some revelatory data structure.

the special cases where you can construct perfect hashes are normally amenable to setting some fixed index 'hashes' for each item and storing them in an array, then never referring to their values for lookups in code, but always using the indices.

in heavily compiled languages like C and C++, when you know everything at compile-time then the run-time should become zero. anything more is a failure to use the language to sufficiently inform the compiler of what is happening.

i suspect this is why perfect hashes do not see the enormous widespread use, because there is a solution for the most common use cases with the exceptional run-time complexity of O(0).

this is enormously faster than the necessarily more complicated perfect hash functions, and only falls down when you have a run-time requirement to do the lookup from the value instead of the key... which in my experience goes hand in hand with not knowing your data until run-time, which rules out a perfect hashing function.

in most run-time cases a binary tree is good enough, its very rare that you need better, and even when you do the optimisation can be found by optimising that structure to pool allocate nodes or similar. the usual, array of linked-list approach can be better for small data sets too... but a perfect hash function isn't even applicable.

all that being said, it is the right solution in those exceptionally rare cases where you know your data ahead of time, but need to look it up by value, rather than hash, at run-time.

HelloNurse · on March 19, 2015

Constrained input that is processed according to what the program knows about each "key" it contains, with keys defined in advanced and effectively specifying a data structure or file format, is actually very common. Example: HTML, XML and similar kinds of text markup; tagged file formats, e.g. PNG; the traditional application to programming language keywords.

jheriko · on March 19, 2015

right, that's a good use case, although i'm not sure its terribly important. the slow part of processing a png file is not identifying its contents... although this maybe different with XML and HTML parsers. i'm still not sure i'd call it common, or even desirable.

for lots of chunk based formats using a switch statement on the chunk-ids is clean and gives the optimisation task to the compiler, which tends to do a pretty good job of these things... even if it generates a gigantic conditional statement. since these sorts of things are 4 or 8 bytes long the 'memcmp is slow' argument falls apart

in fact this sort of switch statement usage is such a good optimisation here that it is used by this perfect hash mechanism itself to solve its own problem, effectively offloading some of the cleverness requirement to the compiler...

also, in the context of programming languages this problem falls into the category i described of being beatable at compile-time. usually this is implemented by tokenising (hashing pieces of) all the string input then only ever using the tokens (hashes, indices) outside of that context. whatever lookups need doing can then become indexing into a nice flat array defined at compile-time.

in the other cases where lookups are needed they are for things like variables and constants, which are not known at compile-time, and the perfect hash can not help...

you might also want hash collisions in cases where languages have aliases for keywords

sure, you could lift some of the performance cost of the tokenisation itself outside of run-time with that, but tokenising programming language input is not the dominant cost of compiling code.