[CLN-list] Re: Integer hash collision

24 Nov 2025

      Hi Florian,
...
I encountered an issue with the way CLN computes hashes of small 
integers. The following test program computes the hashes of the integers 
0-49:
...
#include <iostream>
#include <cln/integer.h>
int main(int argc, char** args) {
    for(int i = 0; i < 50; i++)
        std::cout << "hash(" << i << ") = " << 
cln::equal_hashcode(cln::cl_I(i)) << std::endl;
    return 0;
}
The output is
...
hash(0) = 0
hash(1) = 65
hash(2) = 66
hash(3) = 98
hash(4) = 67
hash(5) = 83
hash(6) = 99
hash(7) = 115
hash(8) = 68
hash(9) = 76
hash(10) = 84
hash(11) = 92
hash(12) = 100
hash(13) = 108
hash(14) = 116
hash(15) = 124
hash(16) = 69
hash(17) = 73
hash(18) = 77
hash(19) = 81
hash(20) = 85
hash(21) = 89
hash(22) = 93
hash(23) = 97
hash(24) = 101
hash(25) = 105
hash(26) = 109
hash(27) = 113
hash(28) = 117
hash(29) = 121
hash(30) = 125
hash(31) = 129
hash(32) = 70
hash(33) = 72
hash(34) = 74
hash(35) = 76
hash(36) = 78
hash(37) = 80
hash(38) = 82
hash(39) = 84
hash(40) = 86
hash(41) = 88
hash(42) = 90
hash(43) = 92
hash(44) = 94
hash(45) = 96
hash(46) = 98
hash(47) = 100
hash(48) = 102
hash(49) = 104
Already in this range, there are several hash collisions:
hash(9) = hash(35)
hash(3) = hash(46)
hash(10) = hash(39)
hash(11) = hash(43)
hash(12) = hash(47)
This is not a problem. It would be a problem if, for instance, 25%
of your inputs would produce the same hash code. Such cases lead to
bad performance [1]. But a *few* inputs that map to the same hash code
are not a problem, since the hash tables use a list structure for
the entries with the same hash code. [2]
...
Hash collisions should be extremely rare.
No. What one needs, is a decent trade-off between the time to compute
the hash code and the time to do the entry lookup in the table.
You can create a hash function that is 5 times more expensive to
compute and has few collisions, but this is not globally useful
if the entry lookup in the table is fast. And in CLN it *is* fast.
...
The issue is that small integers also yield small hashes. This is 
incorrect behaviour for a hash function, which should return values 
uniformly distributed in its output range, in this case [0,2^32-1].
You are missing a step in your thinking:
  1. The hashcode function gets computed.
  2. The hash code value is taken modulo the table size.
  3. The entry or collision list is looked up.
There is no requirement that the output of step 1 is uniformly distributed
across [0,2^32-1].
It it the output of step 2 that should be uniformly distributed across
[0,size-1], and the modulo operation provides a large part of this effect,
when the table size is relatively prime to the 2^32.

Generally, for learning about hash codes and hash tables, I recommend
  Donald E. Knuth: The Art of Computer Programming,
  vol. 2: Seminumerical Algorithms.
He focuses more on open-coded hash tables than on the approach with a list
structure, but the main principles are the same.
...
Is there a reason to a custom implementation of this purpose, rather 
than a standard algorithm such as CRC32?
CRC32 is a particularly expensive algorithm. Ca. 40%-50% of the CPU time of
the 'gzip' program is spent in computing CRCs. [3][4]

Additionally, for equal_hashcode() the implementation does not have
an arbitrary freedom. It must ensure that numbers of different types
with the same value, such as 17, 17f0, 17e0, 17d0, 17L0, produce the same
hash code.

Bruno

[1] https://www.haible.de/bruno/hashfunc.html
[2] cln/src/base/hash/cl_hashset.h method 'get'
[3] https://lists.gnu.org/archive/html/bug-gnulib/2024-10/msg00071.html
[4] https://lists.gnu.org/archive/html/bug-gnulib/2024-10/msg00074.html

[CLN-list] Re: Integer hash collision

Bruno Haible