Possible memory leak? (was: update on fast mult benchmark (fwd))
Hi! I have tried to run Richard Fateman's multiplication benchmark also with GiNaC (1.0.2,1.0.6) but unsuccefully because expand() method starts to eat large amounts of memory so that I had to kill the process. Could this indicate a possible memory leak in GiNaC? Any other ideas? Here is my test program: #include <ginac/ginac.h> using namespace GiNaC; int main() { symbol x("x"), y("y"), z("z"); ex r = (ex(power(1 + x + y + z, 20 ))).expand(); ex r1 = r+1; ex m = (r*r1).expand(); return 0; } Regards, Pearu ---------- Forwarded message ---------- Date: Fri, 22 Mar 2002 16:01:01 -0800 From: Richard Fateman <fateman@cs.berkeley.edu> Newsgroups: sci.math.symbolic Subject: update on fast mult benchmark I've added some more programs. It looks like this now. MockMMA 5.8 seconds (including 2.1 in GC) Macsyma (in ACL 6.1) 6.9 seconds (including 2.0 in GC) Hashing 8.6 seconds (including 2.2 in GC) Fermat 9.2 seconds Macsyma (maxima5.5) 32.0 seconds (? GC not reported, second trial) Macsyma (commercial) 37.5 seconds (including 29.78 in GC) Macsyma (maxima5.5) 53.5 seconds (? GC not reported, first trial) Maple7 50.0 seconds (? GC, it was slower on 2nd trial) Mathematica 4.1 82.3 seconds (? it was slower on 2nd trial) MuPad 1.4.2 Pro 118.6 seconds for multiplying the expanded version of (1+x+y+z)^20 by itself. details on http://www.cs.berkeley.edu/~fateman/papers/fastmult.pdf comments invited. RJF
Hi, On Sat, 23 Mar 2002, Pearu Peterson wrote:
I have tried to run Richard Fateman's multiplication benchmark also with GiNaC (1.0.2,1.0.6) but unsuccefully because expand() method starts to eat large amounts of memory so that I had to kill the process.
Could this indicate a possible memory leak in GiNaC? Any other ideas? Here is my test program:
#include <ginac/ginac.h> using namespace GiNaC; int main() { symbol x("x"), y("y"), z("z"); ex r = (ex(power(1 + x + y + z, 20 ))).expand(); ex r1 = r+1; ex m = (r*r1).expand(); return 0; }
Incidentally, we have had this problem before. See the message <http://www.ginac.de/lists/ginac-devel/msg00062.html> and the following one. That has been solved then, but the test by Richard Fateman shows clearly that there are other problematic cases. What happens? The above test runs when you give it one GigaByte. And there is no memory-leak, because you can run it infinitely many times. If we look at GiNaC's memory footprint of the final result, it becomes obvious that its representation is quite competitive, too. The culprit is mul::expand(), where two expanded polynomials (class add), one with n1 terms and another with n2 terms are first written out by brute force in an n1*n2 vector. This is then returned through an add constructor which sorts it (calling expairseq::canonize()) and then compacitifies it (calling expairseq::combine_same_terms_sorted_seq()). For two large input polynomials this is clearly bogus, because it does not allow for early cancellations. It was done this way because the compexity of sorting is asymptotically ideal (n*log(n) comparisons). This seems to have been a short-sighted decision. I can think of a gazillion ways how to repair this and will try to provide a fix really soon now. It needs some careful balancing, however. (Incidentally, I already had some trial-code in my private tree which performs much better.) To me it appears that this test per se does not tell us very much about representation. Besides questions of representation, the actual implementation of expand() might differ between the systems compared by Richard Fateman. In particular, I am amazed to see the difference between Maple, Mathematica and MuPAD being so tiny. This contradicts my own experience with handling polynomials in these three systems. The pattern t(MuPAD) > t(Maple) > t(Mma) looks reasonable, but the differnece is usually *much* bigger. Geez, and we thought polynomial expansion was trivial... Regards -richy. PS: Richard: 1234567890 < (2^31)-1, on page 1, third paragraph. -- Richard B. Kreckel <Richard.Kreckel@Uni-Mainz.DE> <http://wwwthep.physik.uni-mainz.de/~kreckel/>
Richard B. Kreckel wrote:
Hi,
On Sat, 23 Mar 2002, Pearu Peterson wrote:
I have tried to run Richard Fateman's multiplication benchmark also with GiNaC (1.0.2,1.0.6)
<<<snip>>>
The culprit is mul::expand(), where two expanded polynomials (class add), one with n1 terms and another with n2 terms are first written out by brute force in an n1*n2 vector. This is then returned through an add constructor which sorts it (calling expairseq::canonize()) and then compacitifies it (calling expairseq::combine_same_terms_sorted_seq()).
This seems to me to be a design that should be revisited. For the univariate case it is particularly bad since multiplying polynomials of degree u and v gives you only degree u+v, and you will have allocated (u+1)*(v+1) cells.
For two large input polynomials this is clearly bogus, because it does not allow for early cancellations. It was done this way because the compexity of sorting is asymptotically ideal (n*log(n) comparisons). This seems to have been a short-sighted decision.
The sorting cost is going to be dominated by the coefficient operations most likely, even if it is n^2. Note that Maple doesn't bother to sort at all. I think that if you test the others against sorting of Maple's results, Maple will look far worse.
I can think of a gazillion ways how to repair this and will try to provide a fix really soon now. It needs some careful balancing, however. (Incidentally, I already had some trial-code in my private tree which performs much better.)
To me it appears that this test per se does not tell us very much about representation. Besides questions of representation, the actual implementation of expand() might differ between the systems compared by Richard Fateman.
There are not too many ways of multiplying sparse polynomials without using u*v multiplies and about u*v -size(answer) adds. For dense polynomials in a finite field, FFT can be used. My guess is that an FFT approach would not be faster on such a small ( :) ) problem as this one. But a careful implementation might win. In order to represent the inputs and the outputs, the FFT would probably have to be done quite a few times since the bignum coefficients would need to be built up by Chinese Remainder. In particular, I am amazed to see the difference between
Maple, Mathematica and MuPAD being so tiny. This contradicts my own experience with handling polynomials in these three systems. The pattern t(MuPAD) > t(Maple) > t(Mma) looks reasonable, but the differnece is usually *much* bigger.
Geez, and we thought polynomial expansion was trivial...
Incidentally, you should be able to do w^2 much faster than w*w. RJF
Regards -richy.
PS: Richard: 1234567890 < (2^31)-1, on page 1, third paragraph.
Thanks. I changed it. That number is large enough so that in my Lisp it is not a single-word "fixnum" .. some bits are used for type tags. I don't know anything about GiNaC internals, but such design decisions involving a unified treatment of arbitrary size integers can affect time, programmer convenience, data size etc. in a very complicated way. When this problem works on GiNaC I'd like to know its speed! RJF
Hi, On Sun, 24 Mar 2002, Richard Fateman wrote:
For two large input polynomials this is clearly bogus, because it does not allow for early cancellations. It was done this way because the compexity of sorting is asymptotically ideal (n*log(n) comparisons). This seems to have been a short-sighted decision.
The sorting cost is going to be dominated by the coefficient operations most likely, even if it is n^2. Note that Maple doesn't bother to sort at all. I think that if you test the others against sorting of Maple's results, Maple will look far worse.
Sure Maple bothers to sort! The order may not be apparent, but they do sort their expressions. Hasn't that been the consensus when the discussion last came up on sci.math.symbolic? By the way: This observation of yours that Maple gets slower on the 2nd trial, isn't this exactly the problem Carl DeVore was describing in sci.math.symbolic recently? It seems to get even slower after more trials. Is this a new "feature" or do older releases of Maple also suffer from this? [...]
There are not too many ways of multiplying sparse polynomials without using u*v multiplies and about u*v -size(answer) adds.
I was referring more to implementation details than to fundamental algorithmic things. Hence the gazillions. :-)
For dense polynomials in a finite field, FFT can be used. My guess is that an FFT approach would not be faster on such a small ( :) ) problem as this one. But a careful implementation might win. In order to represent the inputs and the outputs, the FFT would probably have to be done quite a few times since the bignum coefficients would need to be built up by Chinese Remainder.
This is exactly what I was thinking when Dan Bernstein recommended the FFT-based method for univariate polynomials when I last met him. I don't know if anything came out of his Zmult project, though... [...]
Geez, and we thought polynomial expansion was trivial...
Incidentally, you should be able to do w^2 much faster than w*w.
Sure in the sense that (w*w).expand() is much faster than (w.expand()*w.expand()).expand(). Oh, and w*w equals pow(w,2) trivially due to evaluation.
PS: Richard: 1234567890 < (2^31)-1, on page 1, third paragraph.
Thanks. I changed it. That number is large enough so that in my Lisp it is not a single-word "fixnum" .. some bits are used for type tags. I don't know anything about GiNaC internals, but such design decisions involving a unified treatment of arbitrary size integers can affect time, programmer convenience, data size etc. in a very complicated way.
Very true. CLN does exactly this: it distinguishes between fast fixnums and slow bignums in a tranparent way using type tags. Must be because another Lisp-fan wrote it... :-) Regards -richy. -- Richard B. Kreckel <Richard.Kreckel@Uni-Mainz.DE> <http://wwwthep.physik.uni-mainz.de/~kreckel/>
Richard B. Kreckel wrote:
Sure Maple bothers to sort! The order may not be apparent, but they do sort their expressions. Hasn't that been the consensus when the discussion last came up on sci.math.symbolic?
I don't call what they do sorting, since they do not impose an external order on expressions: it is merely collecting of terms. This can be done by a hash table in (probably) O(n) time. An answer which does not allow you to tell the degree of a polynomial without looking through all the terms is less useful. A test which requires division would probably show this.
By the way: This observation of yours that Maple gets slower on the 2nd trial, isn't this exactly the problem Carl DeVore was describing in sci.math.symbolic recently? It seems to get even slower after more trials. Is this a new "feature" or do older releases of Maple also suffer from this?
I have no old copies "alive".
[...]
There are not too many ways of multiplying sparse polynomials without using u*v multiplies and about u*v -size(answer) adds.
I was referring more to implementation details than to fundamental algorithmic things. Hence the gazillions. :-)
For dense polynomials in a finite field, FFT can be used. My guess is that an FFT approach would not be faster on such a small ( :) ) problem as this one. But a careful implementation might win. In order to represent the inputs and the outputs, the FFT would probably have to be done quite a few times since the bignum coefficients would need to be built up by Chinese Remainder.
This is exactly what I was thinking when Dan Bernstein recommended the FFT-based method for univariate polynomials when I last met him. I don't know if anything came out of his Zmult project, though...
It would be possible to encode the question and the answer as a huge multiplication of 2 integers. The loss in encoding and decoding would be either very wasteful or very tricky. <snip> RJF
Hi, On Sun, 24 Mar 2002, Richard Fateman wrote:
This seems to me to be a design that should be revisited. For the univariate case it is particularly bad since multiplying polynomials of degree u and v gives you only degree u+v, and you will have allocated (u+1)*(v+1) cells.
Sure. Done now.
When this problem works on GiNaC I'd like to know its speed!
Here are timings on a P-III, 1GHz, 512kB cache, running Linux, with approximate memory usage: System time Memory --------------------------------------------------+--------+--------- GiNaC-1.0.8 (from CVS), using GCC-3.1 prerelease: 57.3s 8MB MuPAD 1.4.2: 119.2s 25MB MuPAD 2.0: 117.0s 30MB Form 3.0: 0.5s 6MB Maxima 5.6 (GCL 2.4.0): 51.0s 8MB MapleVR3: 16.7s 7MB MapleVR4: 17.9s 8MB MapleVR5: 18.4s 8MB Mathematica 4.1: 80.5s 11MB Reduce 3.6: Segmentation Fault Yacas 1.0.48: N/A (smaller examples fail to expand fully) Richard, I noticed a mistake in the printed version of the Mathematica program (missing Expand[]) and the MockMMA doesn't seem right to me either. Note that Form can really only do ring operations like in this test, but its speed there should come as no surprise. You probably just cannot beat it with any more general system. Regards -richy. -- Richard B. Kreckel <Richard.Kreckel@Uni-Mainz.DE> <http://wwwthep.physik.uni-mainz.de/~kreckel/>
Thanks for this data. Your timings look like mine for the ones that overlap. The difference between 1Gh and 933Mhz is not that great, and I guess the other parameters are close too. It is odd that Maple 7 is so much slower than Maple4 etc. Maybe the change to longer forms slowed it down. I tried downloading form3.0, and I find it mysterious. Maybe I am not using it correctly. My first version took 101.23 seconds. I repeated it and it took 97 seconds This is what I did: C:\form>form sample3.frm form sample3.frm FORM by J.Vermaseren,version 3.0(Jan 28 2001) Run at: Thu Mar 28 10:26:41 2002 Symbols x,y,z; Local q=(1+x+y+z)^20; Local q1=q+1; Local res=q*q1; .end ..... This changed program however took only 0.98 second: C:\form>form sample7.frm form sample7.frm FORM by J.Vermaseren,version 3.0(Jan 28 2001) Run at: Thu Mar 28 10:50:53 2002 Symbols x,y,z; Local q2=(1+x+y+z)^20*(1+(1+x+y+z)^20); .end Time = 1.04 sec Generated terms = 14112 q2 Terms in output = 12341 Bytes used = 305378 Richard B. Kreckel wrote:
Hi,
On Sun, 24 Mar 2002, Richard Fateman wrote:
This seems to me to be a design that should be revisited. For the univariate case it is particularly bad since multiplying polynomials of degree u and v gives you only degree u+v, and you will have allocated (u+1)*(v+1) cells.
Sure. Done now.
When this problem works on GiNaC I'd like to know its speed!
Here are timings on a P-III, 1GHz, 512kB cache, running Linux, with approximate memory usage:
System time Memory --------------------------------------------------+--------+--------- GiNaC-1.0.8 (from CVS), using GCC-3.1 prerelease: 57.3s 8MB MuPAD 1.4.2: 119.2s 25MB MuPAD 2.0: 117.0s 30MB Form 3.0: 0.5s 6MB Maxima 5.6 (GCL 2.4.0): 51.0s 8MB MapleVR3: 16.7s 7MB MapleVR4: 17.9s 8MB MapleVR5: 18.4s 8MB Mathematica 4.1: 80.5s 11MB Reduce 3.6: Segmentation Fault Yacas 1.0.48: N/A (smaller examples fail to expand fully)
Richard, I noticed a mistake in the printed version of the Mathematica program (missing Expand[]) and the MockMMA doesn't seem right to me either. Note that Form can really only do ring operations like in this test, but its speed there should come as no surprise. You probably just cannot beat it with any more general system.
Regards -richy.
I tried downloading form3.0, and I find it mysterious. Maybe I am not using it correctly. My first version took 101.23 seconds. I repeated it and it took 97 seconds
This is what I did:
C:\form>form sample3.frm form sample3.frm FORM by J.Vermaseren,version 3.0(Jan 28 2001) Run at: Thu Mar 28 10:26:41 2002 Symbols x,y,z; Local q=(1+x+y+z)^20; Local q1=q+1; Local res=q*q1; .end ..... This changed program however took only 0.98 second:
C:\form>form sample7.frm form sample7.frm FORM by J.Vermaseren,version 3.0(Jan 28 2001) Run at: Thu Mar 28 10:50:53 2002 Symbols x,y,z; Local q2=(1+x+y+z)^20*(1+(1+x+y+z)^20); .end
Time = 1.04 sec Generated terms = 14112
q2 Terms in output = 12341
Bytes used = 305378
The reason might be that the first power is expanded as (1+x+y+z)*(1+x+y+z)*...*(1+(1+x+y+z)*...*(1+x+y+z)) which will expand much faster than making the product of 1+x+y+z expanded. This would be the same reason why it is never a good idea to use the chinese powering algorithm for expansion of power of multivariate polynomials.
Hi, I just discovered Form for myself. It is just too amazing how efficient and powerful and yet so inefficient and limited this program is (if misused in the late case, I guess). Here follows another example of misusing the Form program. Just from the curiosity I tried to evaluate q/q where q=(1+x+y+z)^20. Any (?) other CAS would result in 1, but not the Form: ------------------------------------------- FORM by J.Vermaseren,version 3.0(Oct 31 2001) Run at: Thu Mar 28 21:30:41 2002 Symbols x,y,z; Local q=(1+x+y+z)^20; Local res=q/q; .end Time = 0.06 sec Generated terms = 1771 q Terms in output = 1771 Bytes used = 31782 Sorted function argument too long. Called from InFunction ---------------------------------------------- Using smaller exponent shows what is going on: ----------------------------------------------- FORM by J.Vermaseren,version 3.0(Oct 31 2001) Run at: Thu Mar 28 21:35:35 2002 Symbols x,y,z; Local q=(1+x+y+z); Local res=q/q; print; .end Time = 0.00 sec Generated terms = 4 q Terms in output = 4 Bytes used = 54 Time = 0.00 sec Generated terms = 4 res Terms in output = 4 Bytes used = 122 q = 1 + x + y + z; res = 1/(1 + x + y + z) + 1/(1 + x + y + z)*x + 1/(1 + x + y + z)*y + 1/(1 + x + y + z)*z; -------------------------------------------- Regards, Pearu
Hi, On Thu, 28 Mar 2002, Richard Fateman wrote:
C:\form>form sample3.frm form sample3.frm FORM by J.Vermaseren,version 3.0(Jan 28 2001) Run at: Thu Mar 28 10:26:41 2002 Symbols x,y,z; Local q=(1+x+y+z)^20; Local q1=q+1; Local res=q*q1; .end .....
Note that using .sort makes the above example approx. 4 times faster: Symbols x,y,z; Local q=(1+x+y+z)^20; .sort Local q1=1+q; .sort Local res=q1*q; .end Regards, Pearu
It is not necessary to have two .sorts in there. I took out the first one, and it was just as fast. Why should this help? I tried this: Symbols x,y,z; Local q=(1+x+y+z)^20*(1+(1+z+x+y)^20)-(1+x+z+y)^20*(1+(1+z+x+y)^20); Print; .end and it took about twice as long as the longest time. (190 seconds) I also tried Local q=(1+x+y+z)^40 and that takes about .98 seconds Local q=(1+x+y+z)^40- (1+x+z+y)^40 and that takes about 1.7 seconds. What is going on? Pearu Peterson wrote:
Hi,
On Thu, 28 Mar 2002, Richard Fateman wrote:
C:\form>form sample3.frm form sample3.frm FORM by J.Vermaseren,version 3.0(Jan 28 2001) Run at: Thu Mar 28 10:26:41 2002 Symbols x,y,z; Local q=(1+x+y+z)^20; Local q1=q+1; Local res=q*q1; .end .....
Note that using .sort makes the above example approx. 4 times faster:
Symbols x,y,z; Local q=(1+x+y+z)^20; .sort Local q1=1+q; .sort Local res=q1*q; .end
Regards, Pearu
Richard Fateman wrote:
It is not necessary to have two .sorts in there. I took out the first one, and it was just as fast. Why should this help?
I tried this:
Symbols x,y,z; Local q=(1+x+y+z)^20*(1+(1+z+x+y)^20)-(1+x+z+y)^20*(1+(1+z+x+y)^20); Print; .end
and it took about twice as long as the longest time. (190 seconds)
I also tried Local q=(1+x+y+z)^40 and that takes about .98 seconds Local q=(1+x+y+z)^40- (1+x+z+y)^40 and that takes about 1.7 seconds.
What is going on?
Taking the power is much faster simply because you do *not* apply the chinese powering algorithm as a simple cost analysis shows for any multivariate polynomial. Just make the 40* multiplication loop.
Hi, On Fri, 29 Mar 2002, parisse wrote:
Richard Fateman wrote:
It is not necessary to have two .sorts in there. I took out the first one, and it was just as fast. Why should this help?
I tried this:
Symbols x,y,z; Local q=(1+x+y+z)^20*(1+(1+z+x+y)^20)-(1+x+z+y)^20*(1+(1+z+x+y)^20); Print; .end
and it took about twice as long as the longest time. (190 seconds)
I also tried Local q=(1+x+y+z)^40 and that takes about .98 seconds Local q=(1+x+y+z)^40- (1+x+z+y)^40 and that takes about 1.7 seconds.
What is going on?
Taking the power is much faster simply because you do *not* apply the chinese powering algorithm as a simple cost analysis shows for any multivariate polynomial. Just make the 40* multiplication loop.
Honestly, I've never heard the term "chinese powering". I assume you mean fast exponentiation (by successively going through the binary representation of the integer exponent)? Sure the multiplication loops are much faster. I guess this is what Knuth means in volume 2, in his section "evaluation of powers". BTW, even faster you can directly build up the result with the proper coefficients in place (as GiNaC does in power::expand_add(), slightly fixed in CVS, too). Errhh, D. Knuth gives a citation of a review article in that section which sounds interesting to me. Since it is quite hard to get, could that be made available online by its author or was it prior to electronic preparation of manuscripts? ;-) However, I am not sure how that can be the culprit why Form is so slow in the second case. The difference for exponent 20 seems to be a factor of ten, based on my own tests. Certainly not a power of 100 or more... Regards -richy. -- Richard B. Kreckel <Richard.Kreckel@Uni-Mainz.DE> <http://wwwthep.physik.uni-mainz.de/~kreckel/>
My guess is that in Form when you write Local q=(1+x+y+z)^40 what is produced is some internal encoded form that is inadequate for addition of polynomials, but is somehow adequate to know how many terms there are. Form, in its devotion to handling large expressions, is not so efficient on adding this expression which has "only" 13,000 terms and can be handled by other computer algebra systems. (Like a page-oriented sort). But this may be entirely wrong. If the review article is something I wrote, (I wrote 2 papers on polynomial powering: sparse and dense), I'll try to get it on line. RJF "Richard B. Kreckel" wrote:
Hi,
On Fri, 29 Mar 2002, parisse wrote:
Richard Fateman wrote:
It is not necessary to have two .sorts in there. I took out the first one, and it was just as fast. Why should this help?
I tried this:
Symbols x,y,z; Local q=(1+x+y+z)^20*(1+(1+z+x+y)^20)-(1+x+z+y)^20*(1+(1+z+x+y)^20); Print; .end
and it took about twice as long as the longest time. (190 seconds)
I also tried Local q=(1+x+y+z)^40 and that takes about .98 seconds Local q=(1+x+y+z)^40- (1+x+z+y)^40 and that takes about 1.7 seconds.
What is going on?
Taking the power is much faster simply because you do *not* apply the chinese powering algorithm as a simple cost analysis shows for any multivariate polynomial. Just make the 40* multiplication loop.
Honestly, I've never heard the term "chinese powering". I assume you mean fast exponentiation (by successively going through the binary representation of the integer exponent)? Sure the multiplication loops are much faster. I guess this is what Knuth means in volume 2, in his section "evaluation of powers". BTW, even faster you can directly build up the result with the proper coefficients in place (as GiNaC does in power::expand_add(), slightly fixed in CVS, too).
Errhh, D. Knuth gives a citation of a review article in that section which sounds interesting to me. Since it is quite hard to get, could that be made available online by its author or was it prior to electronic preparation of manuscripts? ;-)
However, I am not sure how that can be the culprit why Form is so slow in the second case. The difference for exponent 20 seems to be a factor of ten, based on my own tests. Certainly not a power of 100 or more...
Regards -richy. -- Richard B. Kreckel <Richard.Kreckel@Uni-Mainz.DE> <http://wwwthep.physik.uni-mainz.de/~kreckel/>
participants (4)
-
parisse
-
Pearu Peterson
-
Richard B. Kreckel
-
Richard Fateman