Dear Bruno and Richard,
I encountered an issue with the way CLN computes hashes of small
integers. The following test program computes the hashes of the integers
0-49:
> #include <iostream>
> #include <cln/integer.h>
>
> int main(int argc, char** args) {
> for(int i = 0; i < 50; i++)
> std::cout << "hash(" << i << ") = " <<
> cln::equal_hashcode(cln::cl_I(i)) << std::endl;
> return 0;
> }
The output is
> hash(0) = 0
> hash(1) = 65
> hash(2) = 66
> hash(3) = 98
> hash(4) = 67
> hash(5) = 83
> hash(6) = 99
> hash(7) = 115
> hash(8) = 68
> hash(9) = 76
> hash(10) = 84
> hash(11) = 92
> hash(12) = 100
> hash(13) = 108
> hash(14) = 116
> hash(15) = 124
> hash(16) = 69
> hash(17) = 73
> hash(18) = 77
> hash(19) = 81
> hash(20) = 85
> hash(21) = 89
> hash(22) = 93
> hash(23) = 97
> hash(24) = 101
> hash(25) = 105
> hash(26) = 109
> hash(27) = 113
> hash(28) = 117
> hash(29) = 121
> hash(30) = 125
> hash(31) = 129
> hash(32) = 70
> hash(33) = 72
> hash(34) = 74
> hash(35) = 76
> hash(36) = 78
> hash(37) = 80
> hash(38) = 82
> hash(39) = 84
> hash(40) = 86
> hash(41) = 88
> hash(42) = 90
> hash(43) = 92
> hash(44) = 94
> hash(45) = 96
> hash(46) = 98
> hash(47) = 100
> hash(48) = 102
> hash(49) = 104
Already in this range, there are several hash collisions:
hash(9) = hash(35)
hash(3) = hash(46)
hash(10) = hash(39)
hash(11) = hash(43)
hash(12) = hash(47)
Hash collisions should be extremely rare. The fact that they happen for
such simple examples seems concerning to me.
This is a problem for programs that compare hashes to check complicated
expressions for equality, as expressions might appear equal despite
having different integer coefficients inside them.
The issue is that small integers also yield small hashes. This is
incorrect behaviour for a hash function, which should return values
uniformly distributed in its output range, in this case [0,2^32-1].
Larger integers sometimes give appropriate hashes, but not always, e.g.:
hash(1234) = 536871000
hash(2000) = 136
The problem also affects rational numbers, whose hashes are just the
difference of the hashes between their numerators and denominators, e.g.:
hash(3/4) = hash(7/10)
hash(12/13) = hash(25/27)
I believe this is a bug in the computation of integer hashes.
Is there a reason to a custom implementation of this purpose, rather
than a standard algorithm such as CRC32?
Best regards
Florian