Bug 17140 - For long vector 'x', tabulate(x, n) may be silently wrong
Summary: For long vector 'x', tabulate(x, n) may be silently wrong
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Misc (show other bugs)
Version: R 3.3.*
Hardware: Other Linux-Debian
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-09-03 23:33 UTC by Suharto Anggono
Modified: 2016-09-15 08:01 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Suharto Anggono 2016-09-03 23:33:36 UTC
R help on 'tabulate', in "Value" section, states the following.
On 64-bit platforms ‘bin’ can have 2^31 or more elements and hence a count could exceed the maximum integer: this is currently an error.

In reality, when a count exceeds the maximum integer, 'tabulate' silently gives wrong answer.

In RStudio in Data Scientist Workbench:

R> tabulate(rep_len(1L, 2^31), 1)
[1] NA
R> tabulate(rep_len(1L, 2^31+1), 1)
[1] -2147483647
R> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] SparkR_1.6.0

loaded via a namespace (and not attached):
[1] tools_3.3.1
Comment 1 Martin Maechler 2016-09-14 19:50:31 UTC
I think we should do the following for  tabulate(bin, nbin)

if(length(bin) > .Machine$integer.max) {
  ## it *can* be that the resulting count is = length(bin) > "maxInt",
  ## and so we should return a double() instead of an integer() vector of counts
}

## yes, it would be much nicer if we had 64bit-integers in R
Comment 2 Martin Maechler 2016-09-15 08:01:52 UTC
(In reply to Martin Maechler from comment #1)
> I think we should do the following for  tabulate(bin, nbin)
> 
> if(length(bin) > .Machine$integer.max) {
>   ## it *can* be that the resulting count is = length(bin) > "maxInt",
>   ## and so we should return a double() instead of an integer() vector of
> counts
> }
> 
> ## yes, it would be much nicer if we had 64bit-integers in R

Now fixed in R-devel (only: some results with no overflow and "integer" type are now "double") with
svn rev.  71255