At some point, hashes are fast enough. I'm not going to be switching from xxhash to shave 40 picoseconds off of hashing small keys, and for large keys it's already a few orders of magnitude faster than my NVMe drive.
If whatever you are doing is heavy on hashmap lookups (and you are ok with not rewriting into something more bespoke+complicated) - the faster hash function and the cheaper baseline cost of calling it - the better (i.e. XXH3 can have disadvantages, with its popular impl. for dispatching to the necessary routine).
This looks like an interesting potential alternative to GxHash. GxHash is nice but sadly I found AES intrinsics to have somewhat high latency on AMD's Zen 2/3 cores, making it a loss on short strings (but until I measured it on our server hw, M4 Max sure had me fooled, it has way better latency despite retiring more AES operations!).
Without getting too deep in the technical weeds, I will assert that rapidhash is an excellent small-key hash function and approximately the state-of-the-art for that purpose. There is nothing else better that I am aware of in this scope.
For bulk hashing there are better functions, but XXH3, GXHash, etc are not among them, being both slower and weaker than the state-of-the-art. Hash functions that can’t pass quality tests are not serious contenders.
Can you share more about the quality tests XXH3 fails? I don't know this space deeply, so I'm having trouble reconciling why XXH3 is popular if it fails tests.
I feel like I'm missing an enormous amount of context here, and I can't be the only one.
How much better is it than old standbys like hashpjw, FNV1A, or djb's favorite¹, and how? It's apparently 350 lines of code instead of 1–10 lines of code. That seems like an important way that it's worse. And it comes with a copyright license that requires you to add it to your documentation, so the risk of forgetting that creates a risk of copyright infringement. So, how big is the benefit?
How much faster is it hashing a small key like the string "__umulh"? Is it actually slower? (For those needing an introduction to hash tables, covering why that question makes sense, Chris Wellons is always excellent: https://nullprogram.com/blog/2025/01/19/)
Why does it have an array commented "default secret parameters" in the public GitHub repo? Are we supposed to change those? Do they provide some kind of security, and if so, what kind, against what threat model?
I'm not sure I can answer your other questions like copyrights & secrets, but the "how is it faster?" is actually pretty easy and worth knowing/saying to anyone unaware - the hash you present is "1 byte at a time". All fast modern functions competing in this space try to do at least 8 bytes at a time, but perhaps a whole 64B cache line. They do tend to have high "startup/finalize" overheads, though which can matter (can easily be dominant) for use in hash tables with modest key lengths instead of hashing large files.
The rest of this comment is kind of ELI5 (I hope!) not because I think @kragen needs that, but because maybe other readers will benefit and I know he leans towards long form text, and @jandrewrogers basically alludes to this stuff in his comment opening this sub-thread. EDIT: as do a few other posters here like at least @curiouscoding https://news.ycombinator.com/item?id=44012740
______
As for benefit - that always depends on "Compared to what??" in your use cases, what CPU you expect things to run on, and so much else. As just a few examples - last I knew, the SMHasher performance tests blocked inlining which already makes them unrealistic comparisons, esp. with really tiny code footprint ones like your FNV1a example. Your higher level table structure might also be caching the hash code for table resize/as a comparison prefix for lookups. So, how much you have to hash at all may depend on doing that, predictability of table sizes, etc. How effective the comparison prefix idea even is can depend upon how common long prefixes are in your key sets which my be literally unknown - could be pathnames from `find` with tons of prefix similarity or none at all.
Furthermore, the entire culture of presentation of performance in this space is sadly "rate" - the reciprocal of what is easi(er) to reason about, namely "time". Why? "Time is additive". Even the headline here is "GB/s". Cycles/byte or ns/byte is just easier. You want to present a time equation with perHashCost + perByteCost*nBytes and both unknown coefficients if inferred from real-times on a modern time sharing OS probably have significant statistical variation. If there is a highly non-random distribution of kicks/competition/interferences, it will usually be easier to see in time kicks. The reciprocal transformation distorts everything. Anyway, a time cost equation (like Knuth v1-3 of days of yore!) also lets you easily compute the break-even string length when rate starts to dominate fixed overheads, namely nBytes_breakEven = perHashCost/perByteCost.
It also lets you estimate an answer to your question of "when does this matter", if your strings are typically under 32B (I believe an early days limit to identifiers for many C compilers, but someone can correct me if that's wrong) and the perByteCost is 0.1 ns/byte (10 GB/s - just a round number), then break even (when per-hash starts to matters more) is 0.1*32 = 3.2 ns or 16..24 cpu cycles on most modern deployments. Said the other way, if your perHash overhead is 16..24 cycles and your hash througput is 10 GB/s, then you better hope your strings are longer than 32 bytes on average or you've probably made a bad choice.
One way in which an equation fails is that it does oversimplify the performance profile even of this simple problem. That profile often has plateaus/cliffs at cache line size and integer size boundaries, but rates fare no better there. Even so, if you are presenting a bunch of hashes, a table sortable by perHash and perByte costs would be the best starting point for most evaluations where the slope is "just an average"/effective value. A plot is best, though.
It's not the fastest on smhasher anymore. umash is better and faster, and gxhash is twice as fast. It's is however very hashmap friendly because of its small size. It's a wyhash variant
Hey Reini, the rapidhash version on the SMHasher repo is outdated.
The latest version was published last Tuesday, I'll submit it to SMHasher shortly.
This new version is much faster than the previous one, and still passes all tests. It should beat umash on your benchmarks.
I'd love to see some benchmarks/comparison on variable length strings. For strings with random length between 10 and 30, gxhash was significantly faster than xxhash for me. I would assume because it processes chunks of up to 32 chars at a time, avoiding branch misses.
Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.
_Fitting_ in the instruction cache isn't hard, but you'd also ideally let there be room for some other code as well :-) For a hash map lookup, where the hashing is frequently inlined a couple of times, code size matters.
That's an example of a fair number of accumulators that are stored as XXHash goes through its input buffer.
Many modern hash functions store more state/accumulators than they used to. Previous generations of hash functions would often just have one or two accumulators and run through the data. Many modern hash functions might even store multiple wider SIMD variables for better mixing.
And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.
> And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.
And there's 150+ registers in the actual chip.
But my argument is more that there isn't really an efficient or inefficient way to use L1. So unless you have an enormous amount of state, the question is moot. And if you have so much state you're spilling to L2, that's not when you worry about good or bad cache use, that's a weird bloat problem.
This looks like an interesting potential alternative to GxHash. GxHash is nice but sadly I found AES intrinsics to have somewhat high latency on AMD's Zen 2/3 cores, making it a loss on short strings (but until I measured it on our server hw, M4 Max sure had me fooled, it has way better latency despite retiring more AES operations!).
For bulk hashing there are better functions, but XXH3, GXHash, etc are not among them, being both slower and weaker than the state-of-the-art. Hash functions that can’t pass quality tests are not serious contenders.
How much better is it than old standbys like hashpjw, FNV1A, or djb's favorite¹, and how? It's apparently 350 lines of code instead of 1–10 lines of code. That seems like an important way that it's worse. And it comes with a copyright license that requires you to add it to your documentation, so the risk of forgetting that creates a risk of copyright infringement. So, how big is the benefit?
How much faster is it hashing a small key like the string "__umulh"? Is it actually slower? (For those needing an introduction to hash tables, covering why that question makes sense, Chris Wellons is always excellent: https://nullprogram.com/blog/2025/01/19/)
Why does it have an array commented "default secret parameters" in the public GitHub repo? Are we supposed to change those? Do they provide some kind of security, and if so, what kind, against what threat model?
The readme and https://scholar.google.com/scholar?hl=es&as_sdt=0%2C5&q=rapi... are no help here. https://github.com/rurban/smhasher?tab=readme-ov-file#summar... explains some of the questions at greater length but of course not rapidhash's particular answers.
______
¹ while (s[i]) h = h * 33 + s[i++]; about 11 machine-code instructions in its full-fledged form: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...
The rest of this comment is kind of ELI5 (I hope!) not because I think @kragen needs that, but because maybe other readers will benefit and I know he leans towards long form text, and @jandrewrogers basically alludes to this stuff in his comment opening this sub-thread. EDIT: as do a few other posters here like at least @curiouscoding https://news.ycombinator.com/item?id=44012740
______
As for benefit - that always depends on "Compared to what??" in your use cases, what CPU you expect things to run on, and so much else. As just a few examples - last I knew, the SMHasher performance tests blocked inlining which already makes them unrealistic comparisons, esp. with really tiny code footprint ones like your FNV1a example. Your higher level table structure might also be caching the hash code for table resize/as a comparison prefix for lookups. So, how much you have to hash at all may depend on doing that, predictability of table sizes, etc. How effective the comparison prefix idea even is can depend upon how common long prefixes are in your key sets which my be literally unknown - could be pathnames from `find` with tons of prefix similarity or none at all.
Furthermore, the entire culture of presentation of performance in this space is sadly "rate" - the reciprocal of what is easi(er) to reason about, namely "time". Why? "Time is additive". Even the headline here is "GB/s". Cycles/byte or ns/byte is just easier. You want to present a time equation with perHashCost + perByteCost*nBytes and both unknown coefficients if inferred from real-times on a modern time sharing OS probably have significant statistical variation. If there is a highly non-random distribution of kicks/competition/interferences, it will usually be easier to see in time kicks. The reciprocal transformation distorts everything. Anyway, a time cost equation (like Knuth v1-3 of days of yore!) also lets you easily compute the break-even string length when rate starts to dominate fixed overheads, namely nBytes_breakEven = perHashCost/perByteCost.
It also lets you estimate an answer to your question of "when does this matter", if your strings are typically under 32B (I believe an early days limit to identifiers for many C compilers, but someone can correct me if that's wrong) and the perByteCost is 0.1 ns/byte (10 GB/s - just a round number), then break even (when per-hash starts to matters more) is 0.1*32 = 3.2 ns or 16..24 cpu cycles on most modern deployments. Said the other way, if your perHash overhead is 16..24 cycles and your hash througput is 10 GB/s, then you better hope your strings are longer than 32 bytes on average or you've probably made a bad choice.
One way in which an equation fails is that it does oversimplify the performance profile even of this simple problem. That profile often has plateaus/cliffs at cache line size and integer size boundaries, but rates fare no better there. Even so, if you are presenting a bunch of hashes, a table sortable by perHash and perByte costs would be the best starting point for most evaluations where the slope is "just an average"/effective value. A plot is best, though.
Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.
You'll probably end up fitting entirely inside the reorder buffer plus a sequential stream from memory, with the actual caches almost irrelevant.
Yes, I'm talking about the data cache.
> But there are ways to make more or less efficient use of the data cache.
How?
You need to touch every byte of input, and nothing should be faster than going right through from start to end.
This is a minor example, but since you asked...
https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L6432
That's an example of a fair number of accumulators that are stored as XXHash goes through its input buffer.
Many modern hash functions store more state/accumulators than they used to. Previous generations of hash functions would often just have one or two accumulators and run through the data. Many modern hash functions might even store multiple wider SIMD variables for better mixing.
And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.
And there's 150+ registers in the actual chip.
But my argument is more that there isn't really an efficient or inefficient way to use L1. So unless you have an enormous amount of state, the question is moot. And if you have so much state you're spilling to L2, that's not when you worry about good or bad cache use, that's a weird bloat problem.