New high-quality hash measures 71GB/s on M4

(github.com)

148 points | by nicoshev11 50 days ago

12 comments

jandrewrogers 47 days ago
Without getting too deep in the technical weeds, I will assert that rapidhash is an excellent small-key hash function and approximately the state-of-the-art for that purpose. There is nothing else better that I am aware of in this scope.
For bulk hashing there are better functions, but XXH3, GXHash, etc are not among them, being both slower and weaker than the state-of-the-art. Hash functions that can’t pass quality tests are not serious contenders.
[-]
- spiffytech 47 days ago
  Can you share more about the quality tests XXH3 fails? I don't know this space deeply, so I'm having trouble reconciling why XXH3 is popular if it fails tests.
  [-]
  - jandrewrogers 47 days ago
    The quality measure for a hash function is the probability that it is a random oracle for any given set of hash keys. Test suites will test tens of thousands of key sets (and billions of keys) with different properties. The distribution of probabilities across those tests can then be compared to the same distribution from a random oracle instead of a hash function.
    The best cryptographic hash functions, like SHA-256, are very close to a random oracle. Weaker cryptographic hash functions, like MD5, are significantly further away and stronger non-cryptographic hash functions, like rapidhash, are further yet. Most non-cryptographic hash functions are so far away from a random oracle in some of the tests as to be effectively broken. XXH3 is an example of this, anyone can try it for themselves using e.g. SMHasher3. For these “broken” cases, the hashes quite obviously don’t have a random distribution.
    There are a couple important points to take away from this:
    Hash quality is a function of the key set. If you know the properties of the key set ahead of time, which is often the case, then you can construct a hash function with narrowly optimal quality and performance even though it is totally broken for many other key sets that are out of scope. These almost always perform better than a general purpose hash function. This is a good performance engineering trick.
    For a hash function to claim suitability for “general purpose”, it needs to be consistently close to a random oracle across all key sizes and key distributions. If it breaks for one key set, it is broken for many others that you didn’t test. Being closer to a random oracle across tens of thousands of tests does not imply that the hash function is not broken (even SHA-256 may be broken) but it greatly improves the probability that you will never accidentally generate a key set that creates a severely biased hash distribution.
    Most non-cryptographic hash functions sacrifice quality to improve latency. Low latency and small I-cache footprint is very important for small-key hashing performance in real systems. The Pareto frontier of quality+latency is still quite an active area of research. Any hash function that was designed 10 years ago will be very far from that frontier, the state-of-the-art is fast-moving.
    There is a similar Pareto frontier for quality+throughput e.g. hashes used as block storage checksums.
    As for why XXH3 is popular, it is incessant self-promotion and marketing. I recently went down a rabbit hole of analyzing quality measures for many hash functions to see how far the state-of-the-art is from the cryptographic hash functions, so the data is pretty fresh. The great news is that non-cryptographic hash quality is now qualitatively far beyond what it was even a few years ago while at the same time improving performance.
    [-]
    - spiffytech 47 days ago
      I'd love to read your findings if you've written them up.
      [-]
      - camgunz 46 days ago
        I'm also naive in this space, but I rely on rurban's smhasher for this: https://github.com/rurban/smhasher. There's also smhasher3: https://github.com/fwojcik/smhasher3/tree/main/results
  - neonsunset 47 days ago
    Same here. And I’m not aware of quality issues with GxHash either. For large(r) inputs I’d expect it to be easily competitive. And it’s better than XXH3 on my M4.
- kragen 47 days ago
  I feel like I'm missing an enormous amount of context here, and I can't be the only one.
  How much better is it than old standbys like hashpjw, FNV1A, or djb's favorite¹, and how? It's apparently 350 lines of code instead of 1–10 lines of code. That seems like an important way that it's worse. And it comes with a copyright license that requires you to add it to your documentation, so the risk of forgetting that creates a risk of copyright infringement. So, how big is the benefit?
  How much faster is it hashing a small key like the string "__umulh"? Is it actually slower? (For those needing an introduction to hash tables, covering why that question makes sense, Chris Wellons is always excellent: https://nullprogram.com/blog/2025/01/19/)
  Why does it have an array commented "default secret parameters" in the public GitHub repo? Are we supposed to change those? Do they provide some kind of security, and if so, what kind, against what threat model?
  The readme and https://scholar.google.com/scholar?hl=es&as_sdt=0%2C5&q=rapi... are no help here. https://github.com/rurban/smhasher?tab=readme-ov-file#summar... explains some of the questions at greater length but of course not rapidhash's particular answers.
  ______
  ¹ while (s[i]) h = h * 33 + s[i++]; about 11 machine-code instructions in its full-fledged form: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...
  [-]
  - jandrewrogers 47 days ago
    The rapidhash algorithm is notable for being general purpose, with unusually low latency relative to its quality metrics.
    Classic algorithms are severely broken for many trivial key sets and therefore are not general purpose. However, they may work great for your specific key set and this is something you can test for if you know the properties of your key set. The point of a general purpose hash function is that you can use it with unknown or unknowable key distributions with some degree of assurance that the distribution of hashes is unlikely to be biased in some way.
    Something else to keep in mind is that those old algorithms are really slow. Modern algorithms explicitly take advantage of the parallelism and width of modern CPU cores. Larger internal hash state allows higher throughput and quality but also increases latency and I-cache footprint, so there is a tension in the design.
    As I mentioned in another post, if the properties of your key set are well-defined and bounded, it can often make sense to design a hash function that is only high-performance and high-quality for that key set. Most people don’t know how to do this effectively so they mostly stick with general-purpose hash functions, one latency-optimized and one throughput-optimized.
  - ajross 47 days ago
    > How much better
    That's the real rub. Hashed data structures were a revelation in the early 90's when we all started using perl. None of this work really changes any of that calculus. It's not like we're all going to be chatting at the water cooler about "Remember the days when all we had was djb2?! How the world has changed!"
    Still interesting, but unless you're directly involved in a heavily benchmarked runtime, it's probably not worth trying to integrate.
  - cb321 47 days ago
    I'm not sure I can answer your other questions like copyrights & secrets, but the "how is it faster?" is actually pretty easy and worth knowing/saying to anyone unaware - the hash you present is "1 byte at a time". All fast modern functions competing in this space try to do at least 8 bytes at a time, but perhaps a whole 64B cache line. They do tend to have high "startup/finalize" overheads, though which can matter (can easily be dominant) for use in hash tables with modest key lengths instead of hashing large files.
    The rest of this comment is kind of ELI5 (I hope!) not because I think @kragen needs that, but because maybe other readers will benefit and I know he leans towards long form text, and @jandrewrogers basically alludes to this stuff in his comment opening this sub-thread. EDIT: as do a few other posters here like at least @curiouscoding https://news.ycombinator.com/item?id=44012740
    ______
    As for benefit - that always depends on "Compared to what??" in your use cases, what CPU you expect things to run on, and so much else. As just a few examples - last I knew, the SMHasher performance tests blocked inlining which already makes them unrealistic comparisons, esp. with really tiny code footprint ones like your FNV1a example. Your higher level table structure might also be caching the hash code for table resize/as a comparison prefix for lookups. So, how much you have to hash at all may depend on doing that, predictability of table sizes, etc. How effective the comparison prefix idea even is can depend upon how common long prefixes are in your key sets which my be literally unknown - could be pathnames from `find` with tons of prefix similarity or none at all.
    Furthermore, the entire culture of presentation of performance in this space is sadly "rate" - the reciprocal of what is easi(er) to reason about, namely "time". Why? "Time is additive". Even the headline here is "GB/s". Cycles/byte or ns/byte is just easier. You want to present a time equation with perHashCost + perByteCost*nBytes and both unknown coefficients if inferred from real-times on a modern time sharing OS probably have significant statistical variation. If there is a highly non-random distribution of kicks/competition/interferences, it will usually be easier to see in time kicks. The reciprocal transformation distorts everything. Anyway, a time cost equation (like Knuth v1-3 of days of yore!) also lets you easily compute the break-even string length when rate starts to dominate fixed overheads, namely nBytes_breakEven = perHashCost/perByteCost.
    It also lets you estimate an answer to your question of "when does this matter", if your strings are typically under 32B (I believe an early days limit to identifiers for many C compilers, but someone can correct me if that's wrong) and the perByteCost is 0.1 ns/byte (10 GB/s - just a round number), then break even (when per-hash starts to matters more) is 0.1*32 = 3.2 ns or 16..24 cpu cycles on most modern deployments. Said the other way, if your perHash overhead is 16..24 cycles and your hash througput is 10 GB/s, then you better hope your strings are longer than 32 bytes on average or you've probably made a bad choice.
    One way in which an equation fails is that it does oversimplify the performance profile even of this simple problem. That profile often has plateaus/cliffs at cache line size and integer size boundaries, but rates fare no better there. Even so, if you are presenting a bunch of hashes, a table sortable by perHash and perByte costs would be the best starting point for most evaluations where the slope is "just an average"/effective value. A plot is best, though.
    [-]
    - kragen 47 days ago
      I wasn't asking how it could be faster; I was asking how much faster it was (or otherwise better), for small keys like "__umulh", where the startup/finalize overheads are worst. Clearly GB/s is a good measure if you're hashing files or Flash FTL blocks, but jandrewrogers specifically said it's state of the art for the purpose of small-key hashing, which to me means things like dictionary words and variable or method names. Maybe that's not what they meant?
      Even on a fairly wimpy microcontroller the hash I linked above only has about four instructions of startup/finalize overhead, even without inlining: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename... (I modified it slightly because GCC was puzzlingly not reducing my array indexing to pointer arithmetic...)
      [-]
      - cb321 47 days ago
        Fair enough. Sorry! You were pretty clear.
        I continue to think that the world would be less confusing if people didn't go around citing GB/s for hash-table hashes, but rather (perHash+-err, perCost+-err) on CPU xyz. Cuts right to the heart of the matters and then folks like jandrewrogers don't have to be vague with "small" to then inspire questions like yours. (Of course, CPUs really vary a lot, too.)
        I also think when people measure these fast string hashes for strings > L1 CPU cache they're mostly measuring a melange of IO/memory system not the hash computation - maybe sometimes unwittingly. L1 CPU caches can deliver 100s of GB/s after all. 1 64B cache line per cycle is 64B/0.2ns = 320 GB/s. Intel's Skylake could do that a decade ago. The memory wall can be a real bear.
        [-]
        kragen 47 days ago
        No problem! I just wanted to clear up any possible misunderstanding.
        Probably if you're hashing a 32KiB string, you're still in L1 cache, but your performance is probably close to this 71GB/s number, which would be 460ns.
        [-]
        cb321 47 days ago
        BTW, there's some analysis of some of the hash functions you mentioned (pjw, FNV1A, or djb) over here https://github.com/nim-lang/Nim/issues/23678 - just in-page search for "pjw" to find a table. Really only 1 cpu & 1 C compiler (though there was some solicitation of more breadth).
        Those byte-at-a-time hashers really are pretty slow unless your strings are very short (like under 3..6 bytes - which, hey, they might be!). Further down there are some plots/graphs with a little more compiler/cpu variety but fewer hash functions. Kind of a long thread, but maybe informative. Anyway, pictures are worth thousands of words and all that.
        EDIT: and FWIW, there's even some preliminary analysis of this new rapid hash TFA is about towards the bottom, but it may be an older, slower version.
      - jandrewrogers 47 days ago
        There is a fundamental tradeoff between throughput and latency for hash functions. The rapidhash algorithm is clearly optimized for low latency in cases where the keys are small, like string dictionaries and similar. It is not competitive with throughput-optimized algorithms of similar quality when average keys are several hundred bytes or larger. If you want even lower-overhead, you will need to specialize the hash algorithm for the properties of the key set. I’ve designed narrowly specialized hashes that basically resolve to two instructions.
        The specifics are somewhat hardware specific but the performance crossover point as a function of key size between latency-optimized algorithms and throughput-optimized algorithms is currently in the 400-500 byte range on modern server hardware, assuming state-of-the-art quality.
- ozgrakkurt 47 days ago
  Xxh3 was miles better than anything else I tested for probabilistic filters, it is high quality and very reliable.
  what is a hash fn that is more reliable and faster than xxh3?
  [-]
- ozgrakkurt 47 days ago
  Key size is only one parameter, you have 5 other parameters, it doesn’t mean much to say a hash fn is good for small keys
aidenn0 47 days ago
At some point, hashes are fast enough. I'm not going to be switching from xxhash to shave 40 picoseconds off of hashing small keys, and for large keys it's already a few orders of magnitude faster than my NVMe drive.
[-]
- jandrewrogers 47 days ago
  Quality matters. XXH3 fails something like 15% of the tests in SMHasher3. There are faster hash functions with much better quality profiles.
  [-]
  - leumassuehtam 47 days ago
    Since you've mentioned, which hash functions are better while being faster than XXH3?
    [-]
    - jandrewrogers 47 days ago
      The list of non-cryptographic hash functions that aren’t currently known to be broken for general purpose hashing is pretty short. There are three dimensions to performance: latency, throughput, and microarchitecture. Algorithm designs explicitly tradeoff older microarchitecture performance for newer microarchitecture performance, depending on if the objective of portability is compatibility or consistency.
      For latency-optimized hashing, rapidhash is currently the fastest algorithm I know of with consistently good hash quality. Hash quality is outside the cryptographic range, which is expected for a latency-optimized design, but close enough that you can at least make a comparison. Should be pretty portable too. It is what I recommend for general-purpose small-key hashing.
      For throughput-optimized hashing, rotohash[0] is currently the fastest algorithm I know of, with single-core throughput saturating CPU cache bandwidth. Hash quality is at the low-end of the cryptographic range, about the same as MD5. While it is built on portable 32-bit primitives, it clearly targets modern microarchitectures — the Zen 5 performance metrics, though not listed, are crazy. This is brand new, part of a research project to efficiently and robustly checksum I/O at current hardware line rates (100s of GB/s).
      Any algorithm can be faster if you completely disregard quality. In fairness, the quality bar has risen quickly as we’ve become better at designing these algorithms and finding the defects. State-of-the-art algorithms a decade ago are all considered obviously and hopelessly broken today. Designing a competitive hash function now requires an extremely high degree of skill and expertise. It was much easier to come into it as a hobbyist and produce an acceptable result ten years ago.
      [0] https://github.com/jandrewrogers/RotoHash
    - shpx 47 days ago
      There aren't any hashes passing all tests that are faster than XXH3 for both short (1-32 byte) inputs and overall throughput from my reading of https://gitlab.com/fwojcik/smhasher3/-/tree/main/results#pas... . MeowHash, t1ha0.aesB, falkhash2 and falkhash1 have higher throughput and fail less tests, but they're slower on short inputs. wyhash is about the same on short inputs and fails less tests, but has half the throughput.
- steeve 47 days ago
  No, just because you’re hashing every ms doesn’t mean we all do.
- neonsunset 47 days ago
  If whatever you are doing is heavy on hashmap lookups (and you are ok with not rewriting into something more bespoke+complicated) - the faster hash function and the cheaper baseline cost of calling it - the better (i.e. XXH3 can have disadvantages, with its popular impl. for dispatching to the necessary routine).
  This looks like an interesting potential alternative to GxHash. GxHash is nice but sadly I found AES intrinsics to have somewhat high latency on AMD's Zen 2/3 cores, making it a loss on short strings (but until I measured it on our server hw, M4 Max sure had me fooled, it has way better latency despite retiring more AES operations!).
  [-]
  - xkcd1963 47 days ago
    Question, what do you do if there is a collision? I saw the github table also mentioned collisions
    [-]
    - lights0123 47 days ago
      They're extremely common in hashtables. You follow standard open addressing or separate chaining procedures: https://en.m.wikipedia.org/wiki/Hash_collision
      [-]
      - LoganDark 47 days ago
        Non-mobile link: https://en.wikipedia.org/wiki/Hash_collision
        [-]
        Xorakios 47 days ago
        Do people still use mobile devices? I thought that was like original star trek?
        [-]
        LoganDark 47 days ago
        I don't know, all I know is that in my experience some HN commenters don't remove the `.m` from their Wikipedia links. Probably because that's inconvenient to do on mobile devices.
    - meindnoch 47 days ago
      Open any data structures textbook and look for "hash map" in the table of contents.
    - saagarjha 47 days ago
      Fall back on secondary hashes or do probing
rurban 47 days ago
It's not the fastest on smhasher anymore. umash is better and faster, and gxhash is twice as fast. It's is however very hashmap friendly because of its small size. It's a wyhash variant
[-]
- nicoshev11 47 days ago
  Hey Reini, the rapidhash version on the SMHasher repo is outdated. The latest version was published last Tuesday, I'll submit it to SMHasher shortly. This new version is much faster than the previous one, and still passes all tests. It should beat umash on your benchmarks.
  [-]
  - Sesse__ 47 days ago
    Am I right in that RAPIDHASH_UNROLLED is now mandatory and the function is thus much larger now?
curiouscoding 47 days ago
I'd love to see some benchmarks/comparison on variable length strings. For strings with random length between 10 and 30, gxhash was significantly faster than xxhash for me. I would assume because it processes chunks of up to 32 chars at a time, avoiding branch misses.
Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.
eqvinox 47 days ago
Why are the numbers in the "performance" section almost all for AArch64? Where are the AMD64 numbers?
RainyDayTmrw 46 days ago
Is this safe to use in public facing web servers? (Can adversaries trigger hash-DoS[1]?)
[1]: https://en.wikipedia.org/wiki/Collision_attack#Hash_flooding
zitoune 41 days ago
"The fastest hash function passing all tests in SMHasher."
GxHash is twice as fast and passes all SMHasher tests...
realo 47 days ago
At a few ns per hash, it means the hash is done before photons from your screen reach your eyes.
Impressive!
[-]
- CyberDildonics 47 days ago
  There is a big difference between throughput and latency. If you do 8 at once with SIMD it doesn't mean you can get a single one done in 1/8th the time.
mertleee 46 days ago
What's the actual application for this?
wpollock 47 days ago
Is rapidhash cache-friendly?
[-]
- wmf 47 days ago
  Hash functions don't have any data reuse so none of them are cache-friendly.
  [-]
  - Retr0id 47 days ago
    You can have cache-unfriendly hash functions, though.
    For example, a common optimisation for CRC implementation involves lookup tables. If your lookup table exceeds L1 then you're going to have a bad time. If your lookup table fits exactly in L1, your benchmarks might look great, while real-world performance suffers (because your other code wants to be making use of L1, too).
    I'd imagine rapidhash avoids lookup tables but the question of cache-friendliness is still pertinent.
  - benwills 47 days ago
    I think they may be asking about the CPU cache.
    [-]
    - Dylan16807 47 days ago
      You have to go out of your way to make a hash that doesn't fit into L1, so again they're all basically the same.
      You'll probably end up fitting entirely inside the reorder buffer plus a sequential stream from memory, with the actual caches almost irrelevant.
      [-]
      - benwills 47 days ago
        Sure. Any worthwhile hash function will fit in the instruction cache. But there are ways to make more or less efficient use of the data cache.
        [-]
        Dylan16807 47 days ago
        > Any worthwhile hash function will fit in the instruction cache.
        Yes, I'm talking about the data cache.
        > But there are ways to make more or less efficient use of the data cache.
        How?
        You need to touch every byte of input, and nothing should be faster than going right through from start to end.
        [-]
        benwills 47 days ago
        I don't know you're experience with hash functions, so you may already know what I'm about to say.
        This is a minor example, but since you asked...
        https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L6432
        That's an example of a fair number of accumulators that are stored as XXHash goes through its input buffer.
        Many modern hash functions store more state/accumulators than they used to. Previous generations of hash functions would often just have one or two accumulators and run through the data. Many modern hash functions might even store multiple wider SIMD variables for better mixing.
        And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.
        [-]
        Dylan16807 47 days ago
        > And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.
        And there's 150+ registers in the actual chip.
        But my argument is more that there isn't really an efficient or inefficient way to use L1. So unless you have an enormous amount of state, the question is moot. And if you have so much state you're spilling to L2, that's not when you worry about good or bad cache use, that's a weird bloat problem.
        Sesse__ 47 days ago
        _Fitting_ in the instruction cache isn't hard, but you'd also ideally let there be room for some other code as well :-) For a hash map lookup, where the hashing is frequently inlined a couple of times, code size matters.
bandrami 47 days ago
I saw the first four words and then was disappointed by the direction it took...
[-]
- coolcase 47 days ago
  Or first 2 words