[PATCH] fix bug/feature causing non-deterministic output
Hi, Please see attached patch which fixes rather peculiar "feature" of GiNaC - non-determinism. One can think of many arguments why non-determinism of CAS is a bad idea. Shortly speaking, it is simply unacceptable. Especially when there is no good reason for it (and I can hardly think of a good reason anyway). At the moment GiNaC relies on OS-assigned memory addresses of type-names to define its term-ordering, which results in different output from the same program. When this bug was brought up in the mailing list (several times) the developers called it a "design decision". Well, it is time to stop the insanity. The attached patch fixes the bug at zero cost to the performance. Here are the results of "/usr/bin/time make check" (second run with pre-warmed caches) unpatched non-deterministic: 132.57 s fixed deterministic: 132.62 s Keep in mind that it wouldn't make GiNaC canonical ordering entirely predictable, since it still depends on the order of symbol declaration and such (btw, fixing that wouldn't be too hard either). But at least for the same program and compiler the result must be identical. PS I added FNV1a hash, one could just keep using crc32 which is already there. It makes no difference since the hash function is now called only once for each type (before it was called for every object). PPS I'm attaching patches against master and release_1-6-2 since I hope that developers recognize it as a bug and update stable releases as well. With best regards, Valery Yundin.
Hello, On Wed, May 21, 2014 at 06:17:44PM +0200, Valery Yundin wrote:
Please see attached patch which fixes rather peculiar "feature" of GiNaC - non-determinism.
There's nothing to fix here, move on. Non-deterministic term order is OK: GiNaC is designed for processing expressions consisting of (many) millions of terms (expressions of this size are very common in high energy physics). The intermediate expressions (which typically are the most sizable) are not going to be printed other than for debugging purposes. Also the output of GiNaC is mostly processed by other programs rather than a human users. Last but not least your patch is just plain buggy.
diff --git a/ginac/basic.cpp b/ginac/basic.cpp index 4287f26..9f982b5 100644 --- a/ginac/basic.cpp +++ b/ginac/basic.cpp @@ -781,7 +781,8 @@ return_type_t basic::return_type_tinfo() const * would all end up with the same hashvalue. */ unsigned basic::calchash() const { - unsigned v = make_hash_seed(typeid(*this)); + static const unsigned seed = make_hash_seed(typeid(*this));
This is just plain wrong. The hash seed should be computed from the type information of the *current* object rather than from the type of the object which initiated the very first call to calchash(). The idea is that object having a different type are typically different so they should have different hash values (even if their operands are the same). In theory one can use a lookup table in order to not recompute the hash seed for the same type twice (GiNaC uses a different approach - make the hash function fast enough so caching gives no advantages).
+ unsigned v = seed; for (size_t i=0; i<nops(); i++) {
In general any patch which makes the hash function substantially slower for nothing good at all will be rejected. If someone needs a predictable output, please write the corresponding print_context. Best regards, Alexei
Hi, Thanks for your comments, I reply inline below. On 24 May 2014 10:11, Alexei Sheplyakov wrote:
Non-deterministic term order is OK: GiNaC is designed for processing expressions consisting of (many) millions of terms (expressions of this size are very common in high energy physics). The intermediate expressions (which typically are the most sizable) are not going to be printed other than for debugging purposes. Also the output of GiNaC is mostly processed by other programs rather than a human users.
I happen to know what kind of expressions one can get in high energy physics quite well. Determinism has nothing to do with humans reading output, but with reproducibility of results. Reproducibility is an important property of any computational program, even Monte-Carlo simulations are deterministic. Being able to compare two 10G files of expressions by simple diff or md5sum/etc is rather convenient.
This is just plain wrong. The hash seed should be computed from the type information of the *current* object rather than from the type of the object which initiated the very first call to calchash(). The idea is that object having a different type are typically different so they should have different hash values (even if their operands are the same).
You are right that it isn't doing what was intended and any class which doesn't overload calchash will use the same seed, which is a mistake. It is of course arguable whether different classes must use different seed, because the only drawback of a hash collision for different types is one extra typeid call in basic::compare. So ensuring lack of collisions for different types might be not really that important, but it is also easy so why not do it.
In theory one can use a lookup table in order to not recompute the hash seed for the same type twice (GiNaC uses a different approach - make the hash function fast enough so caching gives no advantages).
If that approach gave the same term order every run of a program everyone would be happy.
In general any patch which makes the hash function substantially slower for nothing good at all will be rejected.
I'm attaching a revised version of a patch, it adds virtual functions to every derived class of basic, which effectively cache the class name hashes in local static variables. It is rather small (if noticeable) price to pay for greater convenience for users. I kept old hash in basic::gethashseed() for any class that does not overload it, though it is best not to have such classes. If someone derives from a basic descendant, but doesn't overload gethashseed() they will share the hash_seed which will result in an extra call to typeid in basic::compare (doubtfully noticeable). make_hash_seed is moved to utils.h
If someone needs a predictable output, please write the corresponding print_context.
Yeah, it always puzzled me why GiNaC doesn't have a sorted_print function but suggests users to write their own. Best, Valery
On May 24, 2014, at 6:30 PM, Valery Yundin <yuvalery@gmail.com> wrote:
On 24 May 2014 10:11, Alexei Sheplyakov wrote:
Non-deterministic term order is OK: GiNaC is designed for processing expressions consisting of (many) millions of terms (expressions of this size are very common in high energy physics). The intermediate expressions (which typically are the most sizable) are not going to be printed other than for debugging purposes. Also the output of GiNaC is mostly processed by other programs rather than a human users.
I happen to know what kind of expressions one can get in high energy physics quite well. Determinism has nothing to do with humans reading output, but with reproducibility of results. Reproducibility is an important property of any computational program, even Monte-Carlo simulations are deterministic. Being able to compare two 10G files of expressions by simple diff or md5sum/etc is rather convenient.
Non-deterministic term order is most definitely not OK. Reproducibility of results is a requirement for many disciplines. Whenever a scientist or engineer changes compute platforms or shares an analysis with coworkers or customers, it’s essential that the analysis be able to be re-validated. Valery’s example of md5sum comparison is spot on. This argument isn’t hypothetical - it’s about whether GiNaC can be relied upon when mistakes could cost fortunes and/or lives. Ensuring deterministic results is the sort of design decision that distinguishes serious tools from cool toys. FWIW, reproducibility of results is also desirable to support software development. Alexei specifically alluded to its usefulness for debugging purposes. It’s obviously desirable for everything associated with a bug report to be reproducible. GiNaC is an impressive work so I don’t mean to offend anyone. However, in this case, it’s important that the GiNaC team come to appreciate that determinism is a real world requirement because denying real world requirements only causes pain in the long run. Regards, -Mark
Hello, On Sun, May 25, 2014 at 12:30:54AM +0200, Valery Yundin wrote:
I happen to know what kind of expressions one can get in high energy physics quite well. Determinism has nothing to do with humans reading output, but with reproducibility of results.
The output of GiNaC is reproducible: the same calculation gives equal expressions, it's just the term order which might differ.
Reproducibility is an important property of any computational program, even Monte-Carlo simulations are deterministic. Being able to compare two 10G files of expressions by simple diff or md5sum/etc is rather convenient.
( echo 'expand(('; cat file1; echo ') - ('; cat file2 '));' ) | ginsh
You are right that it isn't doing what was intended and any class which doesn't overload calchash will use the same seed, which is a mistake. It is of course arguable whether different classes must use different seed, because the only drawback of a hash collision for different types is one extra typeid call in basic::compare.
I'm terribly sorry, but everything which makes the hash function worse (slower or produces more collisions) is not acceptable.
I'm attaching a revised version of a patch, it adds virtual functions to every derived class of basic, which effectively cache the class name hashes in local static variables.
Unfortunately this version is even worse than the previous one.
--- a/ginac/basic.h +++ b/ginac/basic.h @@ -226,4 +226,5 @@ protected:
virtual unsigned calchash() const; + virtual unsigned gethashseed() const;
1) Now everyone have to copy-paste gethashseed() into their classes.
@@ -782,5 +781,5 @@ return_type_t basic::return_type_tinfo() const unsigned basic::calchash() const { - unsigned v = make_hash_seed(typeid(*this)); + unsigned v = gethashseed(); for (size_t i=0; i<nops(); i++) { v = rotate_left(v); @@ -797,4 +796,11 @@ unsigned basic::calchash() const
2) Previously the hash seed was computed by a simple (inline) function. Now the calculation involves a call of the virtual method (which is *much* slower).
It is rather small (if noticeable) price to pay for greater convenience for users.
If you don't mind making your computations 10% slower you can just unconditionally #define GINAC_HASH_USE_MANGLED_NAME 1 in hash_seed.h (no, this change is not going to be applied to the official version of GiNaC).
Yeah, it always puzzled me why GiNaC doesn't have a sorted_print function but suggests users to write their own.
None of developers happened to need such a function. Best regards, Alexei
Hi, On 29 May 2014 07:31, Alexei Sheplyakov <alexei.sheplyakov@gmail.com> wrote:
The output of GiNaC is reproducible: the same calculation gives equal expressions, it's just the term order which might differ. Different term order may affect some algorithms and result in completely different expressions (like solving undetermined system).
( echo 'expand(('; cat file1; echo ') - ('; cat file2 '));' ) | ginsh To use this users need to write every GiNaC expression in a separate file?
I'm terribly sorry, but everything which makes the hash function worse (slower or produces more collisions) is not acceptable. And I'm terribly sorry that you care about a single function more than about the whole library.
Unfortunately this version is even worse than the previous one. 1) Now everyone have to copy-paste gethashseed() into their classes. They don't have to, only if they want deterministic ordering.
2) Previously the hash seed was computed by a simple (inline) function. Now the calculation involves a call of the virtual method (which is *much* slower). Do you care about hash function being abstractly fast or the speed of the GiNaC as a whole? If first then I have bad news for you, GiNaC has *hundreds* of virtual functions all of which are "much slower". If you ever ran a profiler on GiNaC you'd know that hashing is nowhere near being a hotspot.
Did you happen to compare them on a real world example and seen a visible difference? Otherwise you are refusing a patch (which changes user experience by adding determinism) using an argument based on a an abstract speed of a single function (which does not affect user experience in any visible way).
If you don't mind making your computations 10% slower you can just unconditionally Again, where does this 10% come from? It is not even close to 1%.
Sincerely, Valery Yundin
participants (3)
-
Alexei Sheplyakov
-
Mark Rahner
-
Valery Yundin