Bug 17292 - c(a=raw(2), raw(2^31-1)) fails
Summary: c(a=raw(2), raw(2^31-1)) fails
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R 3.3.*
Hardware: Other Linux-Debian
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2017-06-18 12:14 UTC by Suharto Anggono
Modified: 2017-07-25 15:29 UTC (History)
1 user (show)

See Also:


Attachments
Against R devel 72947, bind.c (265 bytes, patch)
2017-07-22 00:18 UTC, Suharto Anggono
Details | Diff
Against R devel r72947, bind.c, for nameData->seqno overflow (1.02 KB, patch)
2017-07-22 01:30 UTC, Suharto Anggono
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Suharto Anggono 2017-06-18 12:14:11 UTC
Failed example:
res <- c(a=raw(2), raw(2^31-1))
Result:
Error: attempt to set index 18446744071562067968/2147483649 in SET_STRING_ELT

Successful example (producing correct 'res'):
res <- c(raw(2), raw(2^31-1))
res <- c(a=raw(2), raw(2^31-2))

Output of sessionInfo() :
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics 
[3] grDevices utils    
[5] datasets  methods  
[7] base     

other attached packages:
[1] SparkR_1.6.1

loaded via a namespace (and not attached):
[1] tools_3.3.3

It was in RStudio, https://my.datascientistworkbench.com/tools/rstudio-ide/ .

I suspect that the culprit is this part in function 'NewExtractNames' in bind.c.
SET_STRING_ELT(data->ans_names, (data->ans_nnames)++, namei);

In function 'NewExtractNames', 'data' is a pointer to struct BindData. In struct BindData, field 'ans_nnames' has type int. When data->ans_nnames is (2^31-1) (maximum int),
(data->ans_nnames)++
leads to overflow. The overflowed value is passed to 'SET_STRING_ELT' in the next call.

It is still like that in R devel r72807.

The error message in the failed example above is also not quite right. The index that is passed to 'SET_STRING_ELT' is not 18446744071562067968.
Comment 1 Martin Maechler 2017-06-20 08:58:00 UTC
Don't do this on your laptop/desktop  unless you have 20 GB free memory !!

Confirmed indeed on latest R-devel (64 bit; Fedora 24 Linux)

> system.time(res <- c(a = raw(2), raw(2^31-1)))
Error in system.time(res <- c(a = raw(2), raw(2^31 - 1))) : 
  attempt to set index 18446744071562067968/2147483649 in SET_STRING_ELT
Timing stopped at: 41.63 9.898 51.86
> gc()
         used (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells 375829 20.1   10000000   534.1    1095554    58.6
Vcells 805118  6.2 2581033758 19691.8 2685229083 20486.7
> 
------------

Your diagnosis seems correct to me.
However this just points again to the issue that bind.c's NewExtractNames()
is unnecessarily inefficient.... as you have noticed in PR#17284
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17284 .

I won't time/stamina to address this myself currently.
Comment 2 Martin Maechler 2017-07-18 16:54:46 UTC
I was wrong.  This is not fixed by your 2nd patch (
https://bugs.r-project.org/bugzilla/attachment.cgi?id=2265)
for PR # 17284 
(https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17284).

I'm still committing that now, so we can work to fix this bug, using the newer code.
Comment 3 Suharto Anggono 2017-07-22 00:18:47 UTC
Created attachment 2275 [details]
Against R devel 72947, bind.c

Not quite right error message from function 'SET_STRING_ELT' in memory.c, like "attempt to set index 18446744071562067968/2147483649 in SET_STRING_ELT" in the example, can be fixed by using an appropriate printf format specifier.
Comment 4 Suharto Anggono 2017-07-22 01:30:28 UTC
Created attachment 2276 [details]
Against R devel r72947, bind.c, for nameData->seqno overflow

In function 'NewExtractNames' in bind.c, another thing that potentially overflows is nameData->seqno. An example that triggers it:
res <- c(a = list(rep(c(b=raw(1)), 2^31-2), raw(2)), recursive=TRUE)

Result (36 GB of memory was used):
Error: could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

The example was successful when (2^31-2) was replaced by (2^31-3).


Casting 'seqno' to double doesn't result in loss of precision if seqno <= R_XLEN_T_MAX.
Alternatively, an appropriate printf format specifier can be used.
Comment 5 Suharto Anggono 2017-07-22 06:36:43 UTC
(In reply to Suharto Anggono from comment #4)
> Casting 'seqno' to double doesn't result in loss of precision if seqno <=
> R_XLEN_T_MAX.
> Alternatively, an appropriate printf format specifier can be used.

I see that function 'VectorIndex' in printutils.c uses %ld. Using %ld in function 'NewName' in bind.c doesn't give the desired result when 'seqno' is 2^31 or more for 64-bit R on Windows. On Windows, long is 32-bit.
Comment 6 Martin Maechler 2017-07-22 17:51:02 UTC
(In reply to Suharto Anggono from comment #5)
> (In reply to Suharto Anggono from comment #4)
> > Casting 'seqno' to double doesn't result in loss of precision if seqno <=
> > R_XLEN_T_MAX.
> > Alternatively, an appropriate printf format specifier can be used.
> 
> I see that function 'VectorIndex' in printutils.c uses %ld. Using %ld in
> function 'NewName' in bind.c doesn't give the desired result when 'seqno' is
> 2^31 or more for 64-bit R on Windows. On Windows, long is 32-bit.

printf() specifiers:  Yes, that's a bit of an unfortunate state I think.
As we  require  C99 nowadays for building R, I would like to use portable C99 types and PRI* format specifiers such as PRId64 from the C99  inttypes.h ,
see e.g.,
    https://en.wikipedia.org/wiki/Printf_format_string

Alternatively, but I think not more portably, we could use long long and %ll
which should be 64-bit also on 32-windows. Currently I think that finding consistent and future proof (and still nice to read) 
printf format specifiers in the R sources is something that should be addressed in a wider context.
Comment 7 Martin Maechler 2017-07-25 15:29:14 UTC
Closing the bug -- 

finding portable printf() format strings needs a new PR with its own subject.
--

Just some "final notes" here, before I forget it:
Actually, we configure check for working PRI macros,

checking for inttypes.h... yes
checking for unsigned long long int... yes
checking for inttypes.h... (cached) yes
checking whether the inttypes.h PRIxNN macros are broken... no
.................

Now if we made this (working PRIxNN macros) a strict requirement we could use them here and in quite a few more places.

Now checked: We have this check for broken PRIxNN since 2005 (!)
in m4/gettext.m4  where the comment says

# Define PRI_MACROS_BROKEN if <inttypes.h> exists and defines the PRI*
# macros to non-string values.  This is the case on AIX 4.3.3.

... so it may be time now to be able to require C compiler/library setup where
<inttypes.h>  correctly implements the PRInXX macros (as by the C99 standard).