Hi Florian,
I encountered an issue with the way CLN computes hashes of small integers. The following test program computes the hashes of the integers 0-49:
#include <iostream> #include <cln/integer.h>
int main(int argc, char** args) { for(int i = 0; i < 50; i++) std::cout << "hash(" << i << ") = " << cln::equal_hashcode(cln::cl_I(i)) << std::endl; return 0; }
The output is
hash(0) = 0 hash(1) = 65 hash(2) = 66 hash(3) = 98 hash(4) = 67 hash(5) = 83 hash(6) = 99 hash(7) = 115 hash(8) = 68 hash(9) = 76 hash(10) = 84 hash(11) = 92 hash(12) = 100 hash(13) = 108 hash(14) = 116 hash(15) = 124 hash(16) = 69 hash(17) = 73 hash(18) = 77 hash(19) = 81 hash(20) = 85 hash(21) = 89 hash(22) = 93 hash(23) = 97 hash(24) = 101 hash(25) = 105 hash(26) = 109 hash(27) = 113 hash(28) = 117 hash(29) = 121 hash(30) = 125 hash(31) = 129 hash(32) = 70 hash(33) = 72 hash(34) = 74 hash(35) = 76 hash(36) = 78 hash(37) = 80 hash(38) = 82 hash(39) = 84 hash(40) = 86 hash(41) = 88 hash(42) = 90 hash(43) = 92 hash(44) = 94 hash(45) = 96 hash(46) = 98 hash(47) = 100 hash(48) = 102 hash(49) = 104
Already in this range, there are several hash collisions:
hash(9) = hash(35) hash(3) = hash(46) hash(10) = hash(39) hash(11) = hash(43) hash(12) = hash(47)
This is not a problem. It would be a problem if, for instance, 25% of your inputs would produce the same hash code. Such cases lead to bad performance [1]. But a *few* inputs that map to the same hash code are not a problem, since the hash tables use a list structure for the entries with the same hash code. [2]
Hash collisions should be extremely rare.
No. What one needs, is a decent trade-off between the time to compute the hash code and the time to do the entry lookup in the table. You can create a hash function that is 5 times more expensive to compute and has few collisions, but this is not globally useful if the entry lookup in the table is fast. And in CLN it *is* fast.
The issue is that small integers also yield small hashes. This is incorrect behaviour for a hash function, which should return values uniformly distributed in its output range, in this case [0,2^32-1].
You are missing a step in your thinking: 1. The hashcode function gets computed. 2. The hash code value is taken modulo the table size. 3. The entry or collision list is looked up. There is no requirement that the output of step 1 is uniformly distributed across [0,2^32-1]. It it the output of step 2 that should be uniformly distributed across [0,size-1], and the modulo operation provides a large part of this effect, when the table size is relatively prime to the 2^32. Generally, for learning about hash codes and hash tables, I recommend Donald E. Knuth: The Art of Computer Programming, vol. 2: Seminumerical Algorithms. He focuses more on open-coded hash tables than on the approach with a list structure, but the main principles are the same.
Is there a reason to a custom implementation of this purpose, rather than a standard algorithm such as CRC32?
CRC32 is a particularly expensive algorithm. Ca. 40%-50% of the CPU time of the 'gzip' program is spent in computing CRCs. [3][4] Additionally, for equal_hashcode() the implementation does not have an arbitrary freedom. It must ensure that numbers of different types with the same value, such as 17, 17f0, 17e0, 17d0, 17L0, produce the same hash code. Bruno [1] https://www.haible.de/bruno/hashfunc.html [2] cln/src/base/hash/cl_hashset.h method 'get' [3] https://lists.gnu.org/archive/html/bug-gnulib/2024-10/msg00071.html [4] https://lists.gnu.org/archive/html/bug-gnulib/2024-10/msg00074.html