Bug 17139 - For long vector 'x', split(x, f) may be wrong
Summary: For long vector 'x', split(x, f) may be wrong
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Misc (show other bugs)
Version: R 3.3.*
Hardware: Other Linux
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-09-03 10:08 UTC by Suharto Anggono
Modified: 2016-10-05 19:40 UTC (History)
1 user (show)

See Also:


Attachments
A fix to split.c (2.83 KB, patch)
2016-09-03 10:21 UTC, Suharto Anggono
Details | Diff
A fix to 'do_split' in split.c (2.83 KB, patch)
2016-09-03 17:34 UTC, Suharto Anggono
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Suharto Anggono 2016-09-03 10:08:49 UTC
In this example, 'split' fails. This is in RStudio in Data Scientist Workbench.

> s <- split(raw(2^31), factor(1))
Error in split.default(raw(2^31), factor(1)) : 
  negative length vectors are not allowed
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SparkR_1.6.0

loaded via a namespace (and not attached):
[1] tools_3.3.1
Comment 1 Suharto Anggono 2016-09-03 10:21:31 UTC
Created attachment 2145 [details]
A fix to split.c

In split.c, I guess that
INTEGER(counts)[j - 1]++;
may lead to integer overflow.

This copies code and uses double vector for 'counts' for length larger than maximum integer.
Comment 2 Suharto Anggono 2016-09-03 17:34:58 UTC
Created attachment 2146 [details]
A fix to 'do_split' in split.c
Comment 3 Martin Maechler 2016-09-30 20:46:45 UTC
Thank you.
Bug confirmed,  and that the fix works, too.

However, I'll prefer to #include the "body" twice  instead of  copy-paste-modify programming.  My plan is to commit a fix tomorrow for R-devel  and also port to R-patched.
Comment 4 Suharto Anggono 2016-10-03 15:31:52 UTC
In file split-incl.c in R devel r71442, in place of _L_int_ , R_xlen_t could always be used.
Comment 5 Suharto Anggono 2016-10-05 15:30:22 UTC
The NEWS entry is the following.
      \item \code{split(<very_long>, *)} now works with long vectors.
      (\PR{17193})

Actually, in 64-bit R 3.3.1, for long vector 'x', split(x, f) works if no cell counts exceed maximum integer. The case is like 'tabulate'.

Example that works:
s <- split(raw(2^31), factor(1:2))

Also, \PR{17193} should be \PR{17139} .
Comment 6 Martin Maechler 2016-10-05 19:40:47 UTC
(In reply to Suharto Anggono from comment #5)
> The NEWS entry is the following.
>       \item \code{split(<very_long>, *)} now works with long vectors.
>       (\PR{17193})
> 
> Actually, in 64-bit R 3.3.1, for long vector 'x', split(x, f) works if no
> cell counts exceed maximum integer. The case is like 'tabulate'.
> 
> Example that works:
> s <- split(raw(2^31), factor(1:2))

Yes, of course.  I've clarified that now, even though the previous entry did not say that split(<very_long>, *) would always fail.   It rather it now always works..
> 
> Also, \PR{17193} should be \PR{17139} .

Definitely, thank you!