Bugzilla – Bug 8528

pgamma - inadequate algorithm design and poor coding

Last modified: 2011-10-22 07:46:59 UTC

From: Prof Brian Ripley <ripley@stats.ox.ac.uk> R versions 2.1.0 to present. Examples shown were computed under Windows R-devel, current SVN, but ix86 Linux shows similar behaviour (sometimes NaN or -Inf rather than Inf, depending on the compiler and optimization level used). The replacement pgamma algorithm used from R 2.1.0 has an inadequate design and no supporting documentation whatsoever. There is no reference given to support the algorithm, and it seems very desirable to use only algorithms with a published (and preferably refereed) analysis, or at least of impeccable provenance. The following errors were found by investigating an example in the d-p-q-r-tests.R regression tests that gave NaN on a real system. These errors were not present in R 2.0.0, which used a normal approximation in that region. We could fix this by reverting where needed to a normal approximation, but that leaves the problem that we have no proof of the validity or accuracy of the rest of the algorithm (if indeed it is accurate). ?pgamma says As from R 2.1.0 'pgamma()' uses a new algorithm (mainly by Morten Welinder) which should be uniformly as accurate as AS 239. Well, it 'should be' but it is not, and we should not be making statements like that. Those in the email quoted in pgamma.c exhibit hubris. There are also at least two examples of sloppy coding that lead to numeric overflow and complete loss of accuracy. Consider > pgamma(seq(0.75, 1.25, by=0.05)*1e100, shape = 1e100, log=T) [1] -3.768207e+98 NaN NaN NaN NaN [6] -6.931472e-01 NaN NaN NaN NaN [11] 0.000000e+00 Warning message: NaNs produced in: pgamma(q, shape, scale, lower.tail, log.p) > pgamma(seq(0.75, 1.25, by=0.05)*1e100, shape = 1e100, log=T, lower=F) [1] 0.000000e+00 NaN NaN NaN NaN [6] -6.931472e-01 Inf Inf Inf Inf [11] -2.685645e+98 Warning message: NaNs produced in: pgamma(q, shape, scale, lower.tail, log.p) > pgamma(c(1-1e-10, 1+1e-10)*1e100, shape = 1e100) [1] NaN NaN (shape=1e25 is enough to cause a breakdown in the first of these, and 1e60 in the rest.) The code has four branches 1) x <= 1 2) x <= alph - 1 && x < 0.8 * (alph + 50)). This has the comment /* incl. large alph */, but that is false. 3) if (alph - 1 < x && alph < 0.8 * (x + 50)). This has the comment /* incl. large x */, but again false. 4) The rest, which uses an asymptotic expansion in pt_ = -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) = -x * log((lambda-x)/x) - (lambda-x) and naively assumes that this is small enough to use a power series expansion in 1/x with coefficients as powers of pt_. To make matters worse, consider > pgamma(0.9*1e25, 1e25, log=T) pgamma_raw(x=9e+024, alph=1e+025, low=1, log=1) using ppois_asymp() pt_ = 5.3605156578263e+022 pp*_asymp(): f=-2.0803930924076e-013 np=-5.3605156578263e+022 nd=-5.3605156578263e+022 f*nd=11151979746.284 [1] -Inf Hmm, how did that manage to lose *all* the accuracy here? Hubris again appears in the comments. Here np and nd are on log scale and if they are large they will be almost equal (and negative), and f is not large (and if it were we could have computed log f). So we can compute the log of np+f*nd accurately as log(np*(1+f*nd/np)) = lnp + log(1+f*nd/np) = lnp + log1p(f*exp(lnd-lnp)) Almost all the mass of gamma(shape=1e100) is concentrated at the nearest representable value to 1e100: > qgamma(c(1e-16, 1-1e-16), 1e100)-1e100 [1] 0 0 (if it can be believed, but this can be verified independently). So being accurate in the middle of the range is pretty academic, but one can at least avoid returning the nonsense of non-monotone cdfs and NaN/infinite probabilities. -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595

NOTES: partial fix in 2.2.1 patched. Needs documentation for e.g. the accuracy claims

Audit (from Jitterbug): Fri Jan 27 15:31:14 2006 ripley changed notes Fri Jan 27 14:31:14 2006 ripley moved from incoming to Accuracy Fri Jan 27 16:37:18 2006 ripley changed notes

From: IandJMSmith@aol.com On 23/02/05 I suggested that given R had included TOMS 708 to correct for the poor performance of pbeta, TOMS 654 should be included to fix all the pgamma problems. I was slightly surprised to find Morten's code had been included instead 2 days later. I noticed but did not worry that the reference to me had been removed. The derivation of the asymptotic expansion for the gamma distribution used by Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm It is fairly easy to understand and find error bounds for and hence include sensibly in an algorithm to calculate pgamma. The basis and accuracy of the some of the algorithms I use is discussed in http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute error in the log of the probability gives a good indication of the accuracy of your answer. In the least extreme example you consider (pgamma(0.9*1e25,1e25,log=T)the absolute error would be about 5360515 and if you exponentiated the result the relative error would be about 10 to the power 2328042. So the answer you wish to calculate is K times 10 to the power -2.32804237034994E+22, where K is somewhere between 10 to the power plus or minus 2328042. In other words when you make the changes to correct this problem, your calculation will still return values with no real meaning but at least users might be aware of this which would be no bad thing! For me this answer is possibly so meaningless that Nan would be preferable. I did mention to Morten that I had updated my code but I believed that for Gnumeric he was quite satisfied with what he had. If you look at the VBA code at http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I made to stop the overflow problems you seem to be worried about. My code for the pdf of the gamma distribution still fails for shape parameters > 2e307 due to multiplication of the shap parameter by 2pi. The code for dgamma will have the same problem unless it is hidden by use of an 80 or more bit floating point processor. The code for the asymptotic expansion for the gamma distribution seems to be fine for any number, excluding silly ones like Nan and Inf. Indeed it takes the difference from the mean as a parameter and if you supply an accurate value you get a sensible answer as mentioned in http://members.aol.com/iandjmsmith/Accuracy.htm I do not share your apparent sense of panic on this matter. I have no problems with error signals like NaNs because it is obvious to the user that things have gone wrong. Inaccurate answers when the user has no reason to expect them are usually far more difficult to spot and in many cases the results are just believed. That for me is a serious problem. I think you will find that the pgamma code of 2.0.0 did not work for small shape parameters (similar to the problems exhibited by pbeta still for small parameters see PR#2894), was inaccurate for large shape parameters (> 1e5) when it resorted to the normal approximation and was pretty slow in between. Indeed, the normal approximation was the cause of PR#7307. I don't understand your comments about "pt_ = -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) = -x * log((lambda-x)/x) - (lambda-x) and naively assumes that this is small enough to use a power series expansion in 1/x with coefficients as powers of pt_. To make matters worse, consider Ã¢Â€Â¦" In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't think it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If |x| < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses log1p(1+x)-x and for other values it uses a continued fraction which essentially evaluates more of the same series used when |x| < .01. Your comments about replacing logspace_add with logspace_sub with simpler code which works at first sight to be a very sensible improvement. However, I would be a bit nervous that lnd-lnp could be very large and the exp of it could return infinity. I'm sure this can be accounted for in the code and lnp + log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly. I am not responsible for the code for calculating the logs of probabilities but I seem to remember that the extremely poor performance of the algorithms in R2.0.0 with logged probabilities was one of the reasons Morten became interested in changing the pgamma code (see PR#7307). I have had a quick look and once the corections mentioned above are made it should be giving nonsense answers with no difficulty. Unfortunately there are still a few examples of sloppy coding and accuracy errors remaining in R. The non-central distribution functions have horrible 1- cancellation errors associated with them (see PR#7099) and separate code is required for the two tails of the distributions to get round the problem. The fix for PR#8251 is a kludge and just moves the inaccuracies to examples with higher non-centrality parameters. pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit implementations. pcauchy works but I believe the pt function is also supposed to work for non integral degrees of freedom so making it work one degree of freedom via pcauchy is hardly much use. qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the parameters, qnbinom can be made very slow indeed. I do not think there is anything wrong with the Cornish-Fisher expansion. It just seems that it is not always very good for the Negative Binomial distribution. In the example above, the initial approximation is out by 2e6. A slightly different problem which can cause qnbinom and qbinom to go into infinite loops is when the q-value is too big. For example qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 approx but the code works with values where one of the statements y := y +1 or y = y - 1; is executed but does not alter the value of y. df(0,2,2,FALSE) should be 1 not 0 df(0,df,2,FALSE) should be infinity for df < 2 not 0 dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0 Presumably R also has similar problems with the pbeta function. As I recall the TOMS 708 code was pretty much included without edits and therefore didn't calculate logs of probabilities except by calculating the probability and then logging it. I assumed this was why it was not used for small shape parameters where the current code does not work, although it did not seem logical to me. Of course, my memory is not what it was but if that is the case and there are problems with modifying the TOMS code, you could try an asymptotic expansion based on http://members.aol.com/iandjmsmith/BinomialApprox.htm This response has been very rushed. I do not write well when I have plenty of time and I felt I had so many different things to say so I apologise if it is all a bit of a jumble. Ian Smith [[alternative HTML version deleted]]

From: Prof Brian Ripley <ripley@stats.ox.ac.uk> For the record, some of these claims are untrue: > df(0, 2, 2) [1] 1 > df(0, 1.3, 2) [1] Inf > x <- 1e-170 > pbeta(x, x, x) [1] 0.5 qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is an error, and so is qnbinom(1E-300,0.000002,10000000000) Here we can guess at what you meant, maybe correctly. There were comments in the source code about needing a better search, and I have recently implemented one. So > qnbinom(1e-10, 1e3, 1e-7) # instant [1] 8117986721 > qnbinom(0.5, 10000000000, 0.000000002) [1] 5e+18 > qnbinom(1e-300, 10000000000, 0.000002) [1] 4.998138e+15 seem to be solved. There were two problems with dbeta, one easily overcome (f underflows) the other fundamental to the way dbinom_raw is computed (n*p can underflow). What I cannot see is why a formula which worked correctly in this region was replaced by one that did not. It is precisely in order not to generate such errors that I used TOMS 708 only in the area where the existing algorithm is problematic. (It may be better elswehere, but I did not have the time to do the requisite analysis. It seems neither do some other people: I would prefer not to spend the time to clear up after such unneeded changes.) On pt, thank you for the report. pt(x, df=1) is not interesting for |x| > 1e150, but it is for smaller values of df and those were underflowing. It is easy to use an asymptotic formula to regain the accuracy. R was potentially generating reports of lack of convergence and loss of accuracy in quite a number of its algorithms, but for reasons unknown to me these were being buried (ML_ERROR did nothing, and has not for a very long time). It's a matter of debate whether in some of these it would be better to return NaN as well, but warnings should have been generated (and now are). As for `panic' (your word: why is it 'panic' to submit a correct bug report?), a major platform returning +Inf for a log probability is very bad news, as is another failing a regression test by getting NaN for a probability which is 0.5. On Sat, 28 Jan 2006 IandJMSmith@aol.com wrote: > On 23/02/05 I suggested that given R had included TOMS 708 to correct for t= > he=20 > poor performance of pbeta, TOMS 654 should be included to fix all the pgamm= > a=20 > problems. I was slightly surprised to find Morten's code had been included= > =20 > instead 2 days later. I noticed but did not worry that the reference to me = > had=20 > been removed.=20 > > The derivation of the asymptotic expansion for the gamma distribution used = > by=20 > Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm= > =20 > It is fairly easy to understand and find error bounds for and hence include= > =20 > sensibly in an algorithm to calculate pgamma. > > The basis and accuracy of the some of the algorithms I use is discussed in= > =20 > http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute = > error=20 > in the log of the probability gives a good indication of the accuracy of yo= > ur=20 > answer. In the least extreme example you consider=20 > (pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and= > if you exponentiated the result=20 > the relative error would be about 10 to the power 2328042. So the answer yo= > u=20 > wish to calculate is K times 10 to the power -2.32804237034994E+22, where K= > is=20 > somewhere between 10 to the power plus or minus 2328042. In other words whe= > n=20 > you make the changes to correct this problem, your calculation will still= > =20 > return values with no real meaning but at least users might be aware of thi= > s which=20 > would be no bad thing! For me this answer is possibly so meaningless that N= > an=20 > would be preferable. > > I did mention to Morten that I had updated my code but I believed that for= > =20 > Gnumeric he was quite satisfied with what he had. If you look at the VBA co= > de at=20 > http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I= > =20 > made to stop the overflow problems you seem to be worried about. My code fo= > r the=20 > pdf of the gamma distribution still fails for shape parameters > 2e307 due = > to=20 > multiplication of the shap parameter by 2pi. The code for dgamma will have = > the=20 > same problem unless it is hidden by use of an 80 or more bit floating point= > =20 > processor. The code for the asymptotic expansion for the gamma distribution= > =20 > seems to be fine for any number, excluding silly ones like Nan and Inf. Ind= > eed it=20 > takes the difference from the mean as a parameter and if you supply an=20 > accurate value you get a sensible answer as mentioned in=20 > http://members.aol.com/iandjmsmith/Accuracy.htm > > I do not share your apparent sense of panic on this matter. I have no=20 > problems with error signals like NaNs because it is obvious to the user tha= > t things=20 > have gone wrong. Inaccurate answers when the user has no reason to expect t= > hem=20 > are usually far more difficult to spot and in many cases the results are ju= > st=20 > believed. That for me is a serious problem. I think you will find that the= > =20 > pgamma code of 2.0.0 did not work for small shape parameters (similar to th= > e=20 > problems exhibited by pbeta still for small parameters see PR#2894), was=20 > inaccurate for large shape parameters (> 1e5) when it resorted to the norma= > l=20 > approximation and was pretty slow in between. Indeed, the normal approximat= > ion was the=20 > cause of PR#7307. > > > I don't understand your comments about=20 > "pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x * log((lambda-x= > )/x) -=20 > (lambda-x)=20 > and naively assumes that this is small enough to use a power series expansi= > on=20 > in 1/x with coefficients as powers of pt_. To make matters worse, consider = > =E2=80=A6" > In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't thin= > k=20 > it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If = > |x|=20 > < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses=20 > log1p(1+x)-x and for other values it uses a continued fraction which essent= > ially=20 > evaluates more of the same series used when |x| < .01. > > Your comments about replacing logspace_add with logspace_sub with simpler= > =20 > code which works at first sight to be a very sensible improvement. However,= > I=20 > would be a bit nervous that lnd-lnp could be very large and the exp of it c= > ould=20 > return infinity. I'm sure this can be accounted for in the code and lnp += > =20 > log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly. > > I am not responsible for the code for calculating the logs of probabilities= > =20 > but I seem to remember that the extremely poor performance of the algorithm= > s in=20 > R2.0.0 with logged probabilities was one of the reasons Morten became=20 > interested in changing the pgamma code (see PR#7307). I have had a quick lo= > ok and=20 > once the corections mentioned above are made it should be giving nonsense a= > nswers=20 > with no difficulty. > > > Unfortunately there are still a few examples of sloppy coding and accuracy= > =20 > errors remaining in R. > > The non-central distribution functions have horrible 1- cancellation errors= > =20 > associated with them (see PR#7099) and separate code is required for the tw= > o=20 > tails of the distributions to get round the problem. > > The fix for PR#8251 is a kludge and just moves the inaccuracies to examples= > =20 > with higher non-centrality parameters. > > pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20 > implementations. pcauchy works but I believe the pt function is also suppos= > ed to work for=20 > non integral degrees of freedom so making it work one degree of freedom via= > =20 > pcauchy is hardly much use. > > qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the parameters,= > =20 > qnbinom can be made very slow indeed. I do not think there is anything wron= > g with=20 > the Cornish-Fisher expansion. It just seems that it is not always very good= > =20 > for the Negative Binomial distribution. In the example above, the initial= > =20 > approximation is out by 2e6. > > A slightly different problem which can cause qnbinom and qbinom to go into= > =20 > infinite loops is when the q-value is too big. For example=20 > qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 app= > rox but the code works=20 > with values where one of the statements y :=3D y +1 or y =3D y - 1; is exec= > uted but=20 > does not alter the value of y. > > df(0,2,2,FALSE) should be 1 not 0 > df(0,df,2,FALSE) should be infinity for df < 2 not 0 > dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0 > > Presumably R also has similar problems with the pbeta function. As I recall= > =20 > the TOMS 708 code was pretty much included without edits and therefore didn= > 't=20 > calculate logs of probabilities except by calculating the probability and t= > hen=20 > logging it. I assumed this was why it was not used for small shape paramete= > rs=20 > where the current code does not work, although it did not seem logical to m= > e.=20 > Of course, my memory is not what it was but if that is the case and there a= > re=20 > problems with modifying the TOMS code, you could try an asymptotic expansio= > n=20 > based on http://members.aol.com/iandjmsmith/BinomialApprox.htm > > This response has been very rushed. I do not write well when I have plenty = > of=20 > time and I felt I had so many different things to say so I apologise if it = > is=20 > all a bit of a jumble.=20 > > Ian Smith > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595

From: Martin Maechler <maechler@stat.math.ethz.ch> >>>>> "BDR" == Prof Brian Ripley <ripley@stats.ox.ac.uk> >>>>> on Tue, 7 Feb 2006 08:52:43 +0000 (GMT) writes: BDR> For the record, some of these claims are untrue: >> df(0, 2, 2) BDR> [1] 1 >> df(0, 1.3, 2) BDR> [1] Inf Well, these first two I had fixed in the mean time. So Ian was right about reporting them. >> x <- 1e-170 >> pbeta(x, x, x) BDR> [1] 0.5 ( but Ian only explicitly mentioned dbeta(x,x,x) for x = 1e-162 and that *did* need to be fixed -- and I see you did fix it; thanks a lot! ) Martin BDR> ............................ BDR> ............................

*** Bug 14710 has been marked as a duplicate of this bug. ***