Bug 16489 - rhyper() get wrong results when m <<n.
Summary: rhyper() get wrong results when m <<n.
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Accuracy (show other bugs)
Version: R 3.2.1
Hardware: All All
: P3 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-07-27 19:57 UTC by Mia Zhang
Modified: 2015-07-29 11:05 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mia Zhang 2015-07-27 19:57:36 UTC
Dear all, 
It seems that rhyper() gives out wrong results randomly when m << n. 

rhyper(nn,m,n,k) generates hypergeometric distribution random numbers. according to theory, the mean of the random numbers generated should be around  m/(m+n)*k,

In my example, I keep m+n=1e9,k=1e6 and changing m. 

Here is what I get: 

> # this result is reasonable, the mean is around 1. 
> mean(rhyper(100,1000,1e9-1000,1e6))
[1] 0.91

> # ok
> mean(rhyper(100,2000,1e9-2000,1e6))
[1] 2.04

> # the mean should be around 5
> mean(rhyper(100,5000,1e9-5000,1e6))
[1] 0.64

> # the mean should be around 6
> mean(rhyper(100,6000,1e9-6000,1e6))
[1] 1.55

> # the mean should be around 8,but it gives out a 0!
> mean(rhyper(100,8000,1e9-8000,1e6))
[1] 0

> # the mean should be around 9
> mean(rhyper(100,9000,1e9-9000,1e6))
[1] 0.45

> #back to normal 
> mean(rhyper(100,10000,1e9-10000,1e6))
[1] 10.46

> #good
> mean(rhyper(100,20000,1e9-20000,1e6))
[1] 20.14

I get the first two right, but the ones in the middle are wrong, and strangely, it comes to normal again. 

Best regards
Mia
Comment 1 Peter Dalgaard 2015-07-27 21:06:22 UTC
Confirmed on OS X. Curiously, it doesn't seem to affect the inverse distribution function method:

 > mean(rhyper(100,8000,1e9-8000,1e6))
[1] 0
> mean(qhyper(runif(100),8000,1e9-8000,1e6))
[1] 8.48

...which may also offer a passable workaround.
Comment 2 Martin Maechler 2015-07-28 15:22:41 UTC
(In reply to Peter Dalgaard from comment #1)
> Confirmed on OS X. Curiously, it doesn't seem to affect the inverse
> distribution function method:
> 
>  > mean(rhyper(100,8000,1e9-8000,1e6))
> [1] 0
> > mean(qhyper(runif(100),8000,1e9-8000,1e6))
> [1] 8.48
> 
> ...which may also offer a passable workaround.

Yes....  Note that we are in a situation of large integer parameters (total = 1e9 *is* large), notably for the cases the algorithms were defined for,
*and*  we already had patched  qhyper() in the past for these.

((though it seems to me that the "large" branch there is also improvable).

BTW: It *is* branch II in the rhyper() algorithm, but there is no underflow 
     to 0 in  w = exp(...); w is actually pretty large (because of 'con')
Comment 3 Martin Maechler 2015-07-29 11:05:39 UTC
(In reply to Martin Maechler from comment #2)
> (In reply to Peter Dalgaard from comment #1)
> > Confirmed on OS X. Curiously, it doesn't seem to affect the inverse
> > distribution function method:
> > 
> >  > mean(rhyper(100,8000,1e9-8000,1e6))
> > [1] 0
> > > mean(qhyper(runif(100),8000,1e9-8000,1e6))
> > [1] 8.48
> > 
> > ...which may also offer a passable workaround.
> 
> Yes....  Note that we are in a situation of large integer parameters (total
> = 1e9 *is* large), notably for the cases the algorithms were defined for,
> *and*  we already had patched  qhyper() in the past for these.
> 
> ((though it seems to me that the "large" branch there is also improvable).
> 
> BTW: It *is* branch II in the rhyper() algorithm, but there is no underflow 
>      to 0 in  w = exp(...); w is actually pretty large (because of 'con')


In the end, it was just an integer multiplication overflow..

Committed to R-devel and will be ported to "R 3.2.2 patched".