Dear Bruno and Richard, I encountered an issue with the way CLN computes hashes of small integers. The following test program computes the hashes of the integers 0-49:
#include <iostream> #include <cln/integer.h>
int main(int argc, char** args) { for(int i = 0; i < 50; i++) std::cout << "hash(" << i << ") = " << cln::equal_hashcode(cln::cl_I(i)) << std::endl; return 0; }
The output is
hash(0) = 0 hash(1) = 65 hash(2) = 66 hash(3) = 98 hash(4) = 67 hash(5) = 83 hash(6) = 99 hash(7) = 115 hash(8) = 68 hash(9) = 76 hash(10) = 84 hash(11) = 92 hash(12) = 100 hash(13) = 108 hash(14) = 116 hash(15) = 124 hash(16) = 69 hash(17) = 73 hash(18) = 77 hash(19) = 81 hash(20) = 85 hash(21) = 89 hash(22) = 93 hash(23) = 97 hash(24) = 101 hash(25) = 105 hash(26) = 109 hash(27) = 113 hash(28) = 117 hash(29) = 121 hash(30) = 125 hash(31) = 129 hash(32) = 70 hash(33) = 72 hash(34) = 74 hash(35) = 76 hash(36) = 78 hash(37) = 80 hash(38) = 82 hash(39) = 84 hash(40) = 86 hash(41) = 88 hash(42) = 90 hash(43) = 92 hash(44) = 94 hash(45) = 96 hash(46) = 98 hash(47) = 100 hash(48) = 102 hash(49) = 104
Already in this range, there are several hash collisions: hash(9) = hash(35) hash(3) = hash(46) hash(10) = hash(39) hash(11) = hash(43) hash(12) = hash(47) Hash collisions should be extremely rare. The fact that they happen for such simple examples seems concerning to me. This is a problem for programs that compare hashes to check complicated expressions for equality, as expressions might appear equal despite having different integer coefficients inside them. The issue is that small integers also yield small hashes. This is incorrect behaviour for a hash function, which should return values uniformly distributed in its output range, in this case [0,2^32-1]. Larger integers sometimes give appropriate hashes, but not always, e.g.: hash(1234) = 536871000 hash(2000) = 136 The problem also affects rational numbers, whose hashes are just the difference of the hashes between their numerators and denominators, e.g.: hash(3/4) = hash(7/10) hash(12/13) = hash(25/27) I believe this is a bug in the computation of integer hashes. Is there a reason to a custom implementation of this purpose, rather than a standard algorithm such as CRC32? Best regards Florian
Hi Florian,
I encountered an issue with the way CLN computes hashes of small integers. The following test program computes the hashes of the integers 0-49:
#include <iostream> #include <cln/integer.h>
int main(int argc, char** args) { for(int i = 0; i < 50; i++) std::cout << "hash(" << i << ") = " << cln::equal_hashcode(cln::cl_I(i)) << std::endl; return 0; }
The output is
hash(0) = 0 hash(1) = 65 hash(2) = 66 hash(3) = 98 hash(4) = 67 hash(5) = 83 hash(6) = 99 hash(7) = 115 hash(8) = 68 hash(9) = 76 hash(10) = 84 hash(11) = 92 hash(12) = 100 hash(13) = 108 hash(14) = 116 hash(15) = 124 hash(16) = 69 hash(17) = 73 hash(18) = 77 hash(19) = 81 hash(20) = 85 hash(21) = 89 hash(22) = 93 hash(23) = 97 hash(24) = 101 hash(25) = 105 hash(26) = 109 hash(27) = 113 hash(28) = 117 hash(29) = 121 hash(30) = 125 hash(31) = 129 hash(32) = 70 hash(33) = 72 hash(34) = 74 hash(35) = 76 hash(36) = 78 hash(37) = 80 hash(38) = 82 hash(39) = 84 hash(40) = 86 hash(41) = 88 hash(42) = 90 hash(43) = 92 hash(44) = 94 hash(45) = 96 hash(46) = 98 hash(47) = 100 hash(48) = 102 hash(49) = 104
Already in this range, there are several hash collisions:
hash(9) = hash(35) hash(3) = hash(46) hash(10) = hash(39) hash(11) = hash(43) hash(12) = hash(47)
This is not a problem. It would be a problem if, for instance, 25% of your inputs would produce the same hash code. Such cases lead to bad performance [1]. But a *few* inputs that map to the same hash code are not a problem, since the hash tables use a list structure for the entries with the same hash code. [2]
Hash collisions should be extremely rare.
No. What one needs, is a decent trade-off between the time to compute the hash code and the time to do the entry lookup in the table. You can create a hash function that is 5 times more expensive to compute and has few collisions, but this is not globally useful if the entry lookup in the table is fast. And in CLN it *is* fast.
The issue is that small integers also yield small hashes. This is incorrect behaviour for a hash function, which should return values uniformly distributed in its output range, in this case [0,2^32-1].
You are missing a step in your thinking: 1. The hashcode function gets computed. 2. The hash code value is taken modulo the table size. 3. The entry or collision list is looked up. There is no requirement that the output of step 1 is uniformly distributed across [0,2^32-1]. It it the output of step 2 that should be uniformly distributed across [0,size-1], and the modulo operation provides a large part of this effect, when the table size is relatively prime to the 2^32. Generally, for learning about hash codes and hash tables, I recommend Donald E. Knuth: The Art of Computer Programming, vol. 2: Seminumerical Algorithms. He focuses more on open-coded hash tables than on the approach with a list structure, but the main principles are the same.
Is there a reason to a custom implementation of this purpose, rather than a standard algorithm such as CRC32?
CRC32 is a particularly expensive algorithm. Ca. 40%-50% of the CPU time of the 'gzip' program is spent in computing CRCs. [3][4] Additionally, for equal_hashcode() the implementation does not have an arbitrary freedom. It must ensure that numbers of different types with the same value, such as 17, 17f0, 17e0, 17d0, 17L0, produce the same hash code. Bruno [1] https://www.haible.de/bruno/hashfunc.html [2] cln/src/base/hash/cl_hashset.h method 'get' [3] https://lists.gnu.org/archive/html/bug-gnulib/2024-10/msg00071.html [4] https://lists.gnu.org/archive/html/bug-gnulib/2024-10/msg00074.html
Dear Bruno, thank you for the detailed explanation. We were working on an algorithm that caches composite objects in GiNaC by their hash value. We will implement a system to handle hash collisions. In the long run this is safer anyway. Had we not caught this issue at this stage, we might have run into some really hard to investigate bugs later on. Best regards Florian
Hi Florian, The idea is indeed to specialize std::hash<GiNaC::ex> around ex::gethash() which falls back to CLN's hashes for objects of numeric type. Is there anything wrong with this simple approach? Is it that size_t return value which is 64 bit on modern systems so that the 32 most significant bits remain unused? (How big do you expect your hash maps to get?) Have a look at ginac/hash_map.h remembering that std::unordered_map<> is basically a hash map! All my best, -richard. PS: CC'ing to the ginac-devel list. -- Richard B. Kreckel <https://in.terlu.de/~kreckel/> On 11/27/25 9:23 AM, Florian Lorkowski wrote:
Dear Bruno,
thank you for the detailed explanation.
We were working on an algorithm that caches composite objects in GiNaC by their hash value. We will implement a system to handle hash collisions. In the long run this is safer anyway. Had we not caught this issue at this stage, we might have run into some really hard to investigate bugs later on.
Best regards Florian
participants (3)
-
Bruno Haible
-
Florian Lorkowski
-
Richard B. Kreckel