Bug 8528 - pgamma - inadequate algorithm design and poor coding
pgamma - inadequate algorithm design and poor coding
Status: NEW
Product: R
Classification: Unclassified
Component: Accuracy
old
All All
: P5 normal
Assigned To: Jitterbug compatibility account
: 14710 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-01-27 19:31 UTC by Jitterbug compatibility account
Modified: 2011-10-22 07:46 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jitterbug compatibility account 2006-01-27 19:31:35 UTC
From: Prof Brian Ripley <ripley@stats.ox.ac.uk>
R versions 2.1.0 to present.

Examples shown were computed under Windows R-devel, current SVN, but ix86 
Linux shows similar behaviour (sometimes NaN or -Inf rather than Inf, 
depending on the compiler and optimization level used).


The replacement pgamma algorithm used from R 2.1.0 has an inadequate 
design and no supporting documentation whatsoever.  There is no reference 
given to support the algorithm, and it seems very desirable to use only 
algorithms with a published (and preferably refereed) analysis, or at 
least of impeccable provenance.

The following errors were found by investigating an example in the 
d-p-q-r-tests.R regression tests that gave NaN on a real system.

These errors were not present in R 2.0.0, which used a normal 
approximation in that region.  We could fix this by reverting where needed 
to a normal approximation, but that leaves the problem that we have no 
proof of the validity or accuracy of the rest of the algorithm (if indeed 
it is accurate).

?pgamma says

      As from R 2.1.0 'pgamma()' uses a new algorithm (mainly by Morten
      Welinder) which should be uniformly as accurate as AS 239.

Well, it 'should be' but it is not, and we should not be making statements 
like that.  Those in the email quoted in pgamma.c exhibit hubris.

There are also at least two examples of sloppy coding that lead to numeric 
overflow and complete loss of accuracy.


Consider

> pgamma(seq(0.75, 1.25, by=0.05)*1e100, shape = 1e100, log=T)
  [1] -3.768207e+98           NaN           NaN           NaN           NaN
  [6] -6.931472e-01           NaN           NaN           NaN           NaN
[11]  0.000000e+00
Warning message:
NaNs produced in: pgamma(q, shape, scale, lower.tail, log.p)
> pgamma(seq(0.75, 1.25, by=0.05)*1e100, shape = 1e100, log=T, lower=F)
  [1]  0.000000e+00           NaN           NaN           NaN           NaN
  [6] -6.931472e-01           Inf           Inf           Inf           Inf
[11] -2.685645e+98
Warning message:
NaNs produced in: pgamma(q, shape, scale, lower.tail, log.p)

> pgamma(c(1-1e-10, 1+1e-10)*1e100, shape = 1e100)
[1] NaN NaN

(shape=1e25 is enough to cause a breakdown in the first of these, and 
1e60 in the rest.)

The code has four branches

1) x <= 1

2) x <= alph - 1 && x < 0.8 * (alph + 50)).  This has the comment
/* incl. large alph */, but that is false.

3) if (alph - 1 < x && alph < 0.8 * (x + 50)).  This has the comment
/* incl. large x */, but again false.

4) The rest, which uses an asymptotic expansion in

pt_ = -x * (log(1 + (lambda-x)/x) - (lambda-x)/x)
= -x * log((lambda-x)/x) - (lambda-x)

and naively assumes that this is small enough to use a power series 
expansion in 1/x with coefficients as powers of pt_.  To make matters 
worse, consider

> pgamma(0.9*1e25, 1e25, log=T)
pgamma_raw(x=9e+024, alph=1e+025, low=1, log=1)
  using ppois_asymp()
pt_ = 5.3605156578263e+022
pp*_asymp(): f=-2.0803930924076e-013 np=-5.3605156578263e+022 
nd=-5.3605156578263e+022  f*nd=11151979746.284
[1] -Inf

Hmm, how did that manage to lose *all* the accuracy here?  Hubris again 
appears in the comments.

Here np and nd are on log scale and if they are large they will be almost 
equal (and negative), and f is not large (and if it were we could have 
computed log f).  So we can compute the log of np+f*nd accurately as

log(np*(1+f*nd/np)) = lnp + log(1+f*nd/np) = lnp + log1p(f*exp(lnd-lnp))


Almost all the mass of gamma(shape=1e100) is concentrated at the nearest 
representable value to 1e100:

> qgamma(c(1e-16, 1-1e-16), 1e100)-1e100
[1] 0 0

(if it can be believed, but this can be verified independently).  So being 
accurate in the middle of the range is pretty academic, but one can at 
least avoid returning the nonsense of non-monotone cdfs and NaN/infinite 
probabilities.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Comment 1 Jitterbug compatibility account 2006-01-27 21:37:00 UTC
NOTES:
 partial fix in 2.2.1 patched.

Needs documentation for e.g. the accuracy claims
Comment 2 Jitterbug compatibility account 2006-01-27 22:37:18 UTC
Audit (from Jitterbug):
Fri Jan 27 15:31:14 2006	ripley	changed notes
Fri Jan 27 14:31:14 2006	ripley	moved from incoming to Accuracy
Fri Jan 27 16:37:18 2006	ripley	changed notes
Comment 3 Jitterbug compatibility account 2006-01-28 14:43:33 UTC
From: IandJMSmith@aol.com
On 23/02/05 I suggested that given R had included TOMS 708 to correct for the 
poor performance of pbeta, TOMS 654 should be included to fix all the pgamma 
problems. I was slightly surprised to find Morten's code had been included 
instead 2 days later. I noticed but did not worry that the reference to me had 
been removed. 

The derivation of the asymptotic expansion for the gamma distribution used by 
Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm 
It is fairly easy to understand and find error bounds for and hence include 
sensibly in an algorithm to calculate pgamma.

The basis and accuracy of the some of the algorithms I use is discussed in 
http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute error 
in the log of the probability gives a good indication of the accuracy of your 
answer. In the least extreme example you consider 
(pgamma(0.9*1e25,1e25,log=T)the absolute error would be about 5360515 and if you exponentiated the result 
the relative error would be about 10 to the power 2328042. So the answer you 
wish to calculate is K times 10 to the power -2.32804237034994E+22, where K is 
somewhere between 10 to the power plus or minus 2328042. In other words when 
you make the changes to correct this problem, your calculation will still 
return values with no real meaning but at least users might be aware of this which 
would be no bad thing! For me this answer is possibly so meaningless that Nan 
would be preferable.

I did mention to Morten that I had updated my code but I believed that for 
Gnumeric he was quite satisfied with what he had. If you look at the VBA code at 
http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I 
made to stop the overflow problems you seem to be worried about. My code for the 
pdf of the gamma distribution still fails for shape parameters > 2e307 due to 
multiplication of the shap parameter by 2pi. The code for dgamma will have the 
same problem unless it is hidden by use of an 80 or more bit floating point 
processor. The code for the asymptotic expansion for the gamma distribution 
seems to be fine for any number, excluding silly ones like Nan and Inf. Indeed it 
takes the difference from the mean as a parameter and if you supply an 
accurate value you get a sensible answer as mentioned in 
http://members.aol.com/iandjmsmith/Accuracy.htm

I do not share your apparent sense of panic on this matter. I have no 
problems with error signals like NaNs because it is obvious to the user that things 
have gone wrong. Inaccurate answers when the user has no reason to expect them 
are usually far more difficult to spot and in many cases the results are just 
believed. That for me is a serious problem. I think you will find that the 
pgamma code of 2.0.0 did not work for small shape parameters (similar to the 
problems exhibited by pbeta still for small parameters see PR#2894), was 
inaccurate for large shape parameters (> 1e5) when it resorted to the normal 
approximation and was pretty slow in between. Indeed, the normal approximation was the 
cause of PR#7307.


I don't understand your comments about 
"pt_ = -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) = -x * log((lambda-x)/x) - 
(lambda-x) 
and naively assumes that this is small enough to use a power series expansion 
in 1/x with coefficients as powers of pt_. To make matters worse, consider …"
In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't think 
it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If |x| 
< .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses 
log1p(1+x)-x and for other values it uses a continued fraction which essentially 
evaluates more of the same series used when |x| < .01.

Your comments about replacing logspace_add with logspace_sub with simpler 
code which works at first sight to be a very sensible improvement. However, I 
would be a bit nervous that lnd-lnp could be very large and the exp of it could 
return infinity. I'm sure this can be accounted for in the code and lnp + 
log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly.

I am not responsible for the code for calculating the logs of probabilities 
but I seem to remember that the extremely poor performance of the algorithms in 
R2.0.0 with logged probabilities was one of the reasons Morten became 
interested in changing the pgamma code (see PR#7307). I have had a quick look and 
once the corections mentioned above are made it should be giving nonsense answers 
with no difficulty.


Unfortunately there are still a few examples of sloppy coding and accuracy 
errors remaining in R.

The non-central distribution functions have horrible 1- cancellation errors 
associated with them (see PR#7099) and separate code is required for the two 
tails of the distributions to get round the problem.

The fix for PR#8251 is a kludge and just moves the inaccuracies to examples 
with higher non-centrality parameters.

pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit 
implementations. pcauchy works but I believe the pt function is also supposed to work for 
non integral degrees of freedom so making it work one degree of freedom via 
pcauchy is hardly much use.

qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the parameters, 
qnbinom can be made very slow indeed. I do not think there is anything wrong with 
the Cornish-Fisher expansion. It just seems that it is not always very good 
for the Negative Binomial distribution. In the example above, the initial 
approximation is out by 2e6.

A slightly different problem which can cause qnbinom and qbinom to go into 
infinite loops is when the q-value is too big. For example 
qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 approx but the code works 
with values where one of the statements y := y +1 or y = y - 1; is executed but 
does not alter the value of y.

df(0,2,2,FALSE) should be 1 not 0
df(0,df,2,FALSE) should be infinity for df < 2 not 0
dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0

Presumably R also has similar problems with the pbeta function. As I recall 
the TOMS 708 code was pretty much included without edits and therefore didn't 
calculate logs of probabilities except by calculating the probability and then 
logging it. I assumed this was why it was not used for small shape parameters 
where the current code does not work, although it did not seem logical to me. 
Of course, my memory is not what it was but if that is the case and there are 
problems with modifying the TOMS code, you could try an asymptotic expansion 
based on http://members.aol.com/iandjmsmith/BinomialApprox.htm

This response has been very rushed. I do not write well when I have plenty of 
time and I felt I had so many different things to say so I apologise if it is 
all a bit of a jumble. 

Ian Smith


	[[alternative HTML version deleted]]

Comment 4 Jitterbug compatibility account 2006-02-07 14:52:43 UTC
From: Prof Brian Ripley <ripley@stats.ox.ac.uk>
For the record, some of these claims are untrue:

> df(0, 2, 2)
[1] 1
> df(0, 1.3, 2)
[1] Inf
> x <- 1e-170
> pbeta(x, x, x)
[1] 0.5

qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is an error, and so is 
qnbinom(1E-300,0.000002,10000000000)

Here we can guess at what you meant, maybe correctly.  There were comments 
in the source code about needing a better search, and I have recently 
implemented one.  So

> qnbinom(1e-10, 1e3, 1e-7) # instant
[1] 8117986721
> qnbinom(0.5, 10000000000, 0.000000002)
[1] 5e+18
> qnbinom(1e-300, 10000000000, 0.000002)
[1] 4.998138e+15

seem to be solved.

There were two problems with dbeta, one easily overcome (f underflows) the 
other fundamental to the way dbinom_raw is computed (n*p can underflow). 
What I cannot see is why a formula which worked correctly in this region 
was replaced by one that did not.  It is precisely in order not to 
generate such errors that I used TOMS 708 only in the area where the 
existing algorithm is problematic.  (It may be better elswehere, but I did 
not have the time to do the requisite analysis.  It seems neither do some 
other people: I would prefer not to spend the time to clear up after such 
unneeded changes.)

On pt, thank you for the report.  pt(x, df=1) is not interesting for
|x| > 1e150, but it is for smaller values of df and those were 
underflowing.  It is easy to use an asymptotic formula to regain the 
accuracy.

R was potentially generating reports of lack of convergence and loss of 
accuracy in quite a number of its algorithms, but for reasons unknown to 
me these were being buried (ML_ERROR did nothing, and has not for a very 
long time). It's a matter of debate whether in some of these it would be 
better to return NaN as well, but warnings should have been generated (and 
now are).

As for `panic' (your word: why is it 'panic' to submit a correct bug 
report?), a major platform returning +Inf for a log probability is very 
bad news, as is another failing a regression test by getting NaN for a 
probability which is 0.5.

On Sat, 28 Jan 2006 IandJMSmith@aol.com wrote:

> On 23/02/05 I suggested that given R had included TOMS 708 to correct for t=
> he=20
> poor performance of pbeta, TOMS 654 should be included to fix all the pgamm=
> a=20
> problems. I was slightly surprised to find Morten's code had been included=
> =20
> instead 2 days later. I noticed but did not worry that the reference to me =
> had=20
> been removed.=20
>
> The derivation of the asymptotic expansion for the gamma distribution used =
> by=20
> Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm=
> =20
> It is fairly easy to understand and find error bounds for and hence include=
> =20
> sensibly in an algorithm to calculate pgamma.
>
> The basis and accuracy of the some of the algorithms I use is discussed in=
> =20
> http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute =
> error=20
> in the log of the probability gives a good indication of the accuracy of yo=
> ur=20
> answer. In the least extreme example you consider=20
> (pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and=
> if you exponentiated the result=20
> the relative error would be about 10 to the power 2328042. So the answer yo=
> u=20
> wish to calculate is K times 10 to the power -2.32804237034994E+22, where K=
> is=20
> somewhere between 10 to the power plus or minus 2328042. In other words whe=
> n=20
> you make the changes to correct this problem, your calculation will still=
> =20
> return values with no real meaning but at least users might be aware of thi=
> s which=20
> would be no bad thing! For me this answer is possibly so meaningless that N=
> an=20
> would be preferable.
>
> I did mention to Morten that I had updated my code but I believed that for=
> =20
> Gnumeric he was quite satisfied with what he had. If you look at the VBA co=
> de at=20
> http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I=
> =20
> made to stop the overflow problems you seem to be worried about. My code fo=
> r the=20
> pdf of the gamma distribution still fails for shape parameters > 2e307 due =
> to=20
> multiplication of the shap parameter by 2pi. The code for dgamma will have =
> the=20
> same problem unless it is hidden by use of an 80 or more bit floating point=
> =20
> processor. The code for the asymptotic expansion for the gamma distribution=
> =20
> seems to be fine for any number, excluding silly ones like Nan and Inf. Ind=
> eed it=20
> takes the difference from the mean as a parameter and if you supply an=20
> accurate value you get a sensible answer as mentioned in=20
> http://members.aol.com/iandjmsmith/Accuracy.htm
>
> I do not share your apparent sense of panic on this matter. I have no=20
> problems with error signals like NaNs because it is obvious to the user tha=
> t things=20
> have gone wrong. Inaccurate answers when the user has no reason to expect t=
> hem=20
> are usually far more difficult to spot and in many cases the results are ju=
> st=20
> believed. That for me is a serious problem. I think you will find that the=
> =20
> pgamma code of 2.0.0 did not work for small shape parameters (similar to th=
> e=20
> problems exhibited by pbeta still for small parameters see PR#2894), was=20
> inaccurate for large shape parameters (> 1e5) when it resorted to the norma=
> l=20
> approximation and was pretty slow in between. Indeed, the normal approximat=
> ion was the=20
> cause of PR#7307.
>
>
> I don't understand your comments about=20
> "pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x * log((lambda-x=
> )/x) -=20
> (lambda-x)=20
> and naively assumes that this is small enough to use a power series expansi=
> on=20
> in 1/x with coefficients as powers of pt_. To make matters worse, consider =
> =E2=80=A6"
> In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't thin=
> k=20
> it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If =
> |x|=20
> < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses=20
> log1p(1+x)-x and for other values it uses a continued fraction which essent=
> ially=20
> evaluates more of the same series used when |x| < .01.
>
> Your comments about replacing logspace_add with logspace_sub with simpler=
> =20
> code which works at first sight to be a very sensible improvement. However,=
> I=20
> would be a bit nervous that lnd-lnp could be very large and the exp of it c=
> ould=20
> return infinity. I'm sure this can be accounted for in the code and lnp +=
> =20
> log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly.
>
> I am not responsible for the code for calculating the logs of probabilities=
> =20
> but I seem to remember that the extremely poor performance of the algorithm=
> s in=20
> R2.0.0 with logged probabilities was one of the reasons Morten became=20
> interested in changing the pgamma code (see PR#7307). I have had a quick lo=
> ok and=20
> once the corections mentioned above are made it should be giving nonsense a=
> nswers=20
> with no difficulty.
>
>
> Unfortunately there are still a few examples of sloppy coding and accuracy=
> =20
> errors remaining in R.
>
> The non-central distribution functions have horrible 1- cancellation errors=
> =20
> associated with them (see PR#7099) and separate code is required for the tw=
> o=20
> tails of the distributions to get round the problem.
>
> The fix for PR#8251 is a kludge and just moves the inaccuracies to examples=
> =20
> with higher non-centrality parameters.
>
> pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20
> implementations. pcauchy works but I believe the pt function is also suppos=
> ed to work for=20
> non integral degrees of freedom so making it work one degree of freedom via=
> =20
> pcauchy is hardly much use.
>
> qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the 
parameters,=
> =20
> qnbinom can be made very slow indeed. I do not think there is anything wron=
> g with=20
> the Cornish-Fisher expansion. It just seems that it is not always very good=
> =20
> for the Negative Binomial distribution. In the example above, the initial=
> =20
> approximation is out by 2e6.
>
> A slightly different problem which can cause qnbinom and qbinom to go into=
> =20
> infinite loops is when the q-value is too big. For example=20
> qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 app=
> rox but the code works=20
> with values where one of the statements y :=3D y +1 or y =3D y - 1; is exec=
> uted but=20
> does not alter the value of y.
>
> df(0,2,2,FALSE) should be 1 not 0
> df(0,df,2,FALSE) should be infinity for df < 2 not 0
> dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0
>
> Presumably R also has similar problems with the pbeta function. As I recall=
> =20
> the TOMS 708 code was pretty much included without edits and therefore didn=
> 't=20
> calculate logs of probabilities except by calculating the probability and t=
> hen=20
> logging it. I assumed this was why it was not used for small shape paramete=
> rs=20
> where the current code does not work, although it did not seem logical to m=
> e.=20
> Of course, my memory is not what it was but if that is the case and there a=
> re=20
> problems with modifying the TOMS code, you could try an 
asymptotic expansio=
> n=20
> based on http://members.aol.com/iandjmsmith/BinomialApprox.htm
>
> This response has been very rushed. I do not write well when I have plenty =
> of=20
> time and I felt I had so many different things to say so I apologise if it =
> is=20
> all a bit of a jumble.=20
>
> Ian Smith
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Comment 5 Jitterbug compatibility account 2006-02-07 16:04:01 UTC
From: Martin Maechler <maechler@stat.math.ethz.ch>
>>>>> "BDR" == Prof Brian Ripley <ripley@stats.ox.ac.uk>
>>>>>     on Tue, 7 Feb 2006 08:52:43 +0000 (GMT) writes:

    BDR> For the record, some of these claims are untrue:
    >> df(0, 2, 2)
    BDR> [1] 1
    >> df(0, 1.3, 2)
    BDR> [1] Inf

Well, these first two I had fixed in the mean time.
So Ian was right about reporting them.

    >> x <- 1e-170
    >> pbeta(x, x, x)
    BDR> [1] 0.5

( but Ian only explicitly mentioned dbeta(x,x,x)  for x = 1e-162 
  and that *did* need to be fixed -- and I see you did fix it; 
  thanks a lot! )

Martin

 BDR> ............................
 BDR> ............................

Comment 6 Brian Ripley 2011-10-22 07:46:59 UTC
*** Bug 14710 has been marked as a duplicate of this bug. ***