Bug 14121

Summary: reshape() makes R run out of memory
Product: R Reporter: Jitterbug compatibility account <jitterbug-import>
Component: MiscAssignee: Jitterbug compatibility account <jitterbug-import>
Status: CLOSED FIXED    
Severity: normal    
Priority: P5    
Version: old   
Hardware: All   
OS: Linux-Ubuntu   

Description Jitterbug compatibility account 2009-12-09 13:29:43 UTC
From: abelikoff@gmail.com
Full_Name: Alexander L. Belikoff
Version: 2.8.1
OS: Ubuntu 9.04 (x86_64)
Submission from: (NULL) (67.244.71.200)


I'm trying to reshape the following data frame:

ID                     DATE1             DATE2      VALUE_TYPE        VALUE
'abcd1233'         2009-11-12        2009-12-23     'TYPE1'           123.45
...

VALUE_TYPE is a string and is a factor with only 2 values (say TYPE1 and TYPE2).
I need to transform it into the following data frame ("wide" transpose) based on
common ID and DATEs:

ID                     DATE1             DATE2      VALUE.TYPE1   VALUE.TYPE2
'abcd1233'         2009-11-12       2009-12-23      123.45        NA
...

Using stock reshape() as follows:

    tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"),
timevar = "VALUE_TYPE");

On a toy data frame this works fine. On a real one with 4.7 million entries
(although about 70% of VALUEs are NA) it runs out of memory:

    Error: cannot allocate vector of size 4.8 Gb

When the real data frame is loaded the R process takes about 200Mb of virtual
memory. The machine has 4 Gb of RAM.

I've posted a .Rdata file with the data frame in question at
http://belikoff.net/stuff/other/reshape_test.Rdata.gz


P.S. Just checked R 2.10.0 using an Intel PC with 2Gb RAM running Xp Pro (32
bit):

> tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"),
timevar = "VALUE_TYPE");
Error: cannot allocate vector of size 53.9 Mb
In addition: Warning messages:
1: In format.POSIXlt(as.POSIXlt(x), ...) :
  Reached total allocation of 1535Mb: see help(memory.size)
2: In format.POSIXlt(as.POSIXlt(x), ...) :
  Reached total allocation of 1535Mb: see help(memory.size)
3: In format.POSIXlt(as.POSIXlt(x), ...) :
  Reached total allocation of 1535Mb: see help(memory.size)
4: In format.POSIXlt(as.POSIXlt(x), ...) :
  Reached total allocation of 1535Mb: see help(memory.size)
5: In format.POSIXlt(as.POSIXlt(x), ...) :
  Reached total allocation of 1535Mb: see help(memory.size)
6: In format.POSIXlt(as.POSIXlt(x), ...) :
  Reached total allocation of 1535Mb: see help(memory.size)
>

Comment 1 Jitterbug compatibility account 2009-12-10 02:45:00 UTC
From: Peter Dalgaard <p.dalgaard@biostat.ku.dk>
abelikoff@gmail.com wrote:
> Full_Name: Alexander L. Belikoff
> Version: 2.8.1
> OS: Ubuntu 9.04 (x86_64)
> Submission from: (NULL) (67.244.71.200)
> 
> 
> I'm trying to reshape the following data frame:
> 
> ID                     DATE1             DATE2      VALUE_TYPE        VALUE
> 'abcd1233'         2009-11-12        2009-12-23     'TYPE1'           123.45
> ...
> 
> VALUE_TYPE is a string and is a factor with only 2 values (say TYPE1 and TYPE2).
> I need to transform it into the following data frame ("wide" transpose) based on
> common ID and DATEs:
> 
> ID                     DATE1             DATE2      VALUE.TYPE1   VALUE.TYPE2
> 'abcd1233'         2009-11-12       2009-12-23      123.45        NA
> ...
> 
> Using stock reshape() as follows:
> 
>     tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"),
> timevar = "VALUE_TYPE");
> 
> On a toy data frame this works fine. On a real one with 4.7 million entries
> (although about 70% of VALUEs are NA) it runs out of memory:
> 
>     Error: cannot allocate vector of size 4.8 Gb
> 
> When the real data frame is loaded the R process takes about 200Mb of virtual
> memory. The machine has 4 Gb of RAM.
> 
> I've posted a .Rdata file with the data frame in question at
> http://belikoff.net/stuff/other/reshape_test.Rdata.gz
> 
> 
> P.S. Just checked R 2.10.0 using an Intel PC with 2Gb RAM running Xp Pro (32
> bit):
> 
>> tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2")c("ID", "DATE1", "DATE2"),
> timevar = "VALUE_TYPE");
> Error: cannot allocate vector of size 53.9 Mb
> In addition: Warning messages:
....

Yes. The culprit would seem to be interaction(), as in

 > x <- y <- z <- 1:999
 > i <- interaction(x,y,z, drop=TRUE)
Error: cannot allocate vector of size 3.7 Gb

which is happening due to the occurrence of three idvar variables. This 
works basically as interaction(x,y,z)[,drop=TRUE], i.e. it first creates 
a factor with 999^3 levels, and removes the empty levels afterward.

In the absense of a better interaction(), you might try making your own 
single idvar as do.call("paste",tbl[,c("ID", "DATE1", "DATE2")]) or so.

-- 
    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)              FAX: (+45) 35327907

Comment 2 Jitterbug compatibility account 2009-12-10 06:10:45 UTC
From: hadley wickham <h.wickham@gmail.com>
> Yes. The culprit would seem to be interaction(), as in
>
>> x <- y <- z <- 1:999
>> i <- interaction(x,y,z, drop=TRUE)
> Error: cannot allocate vector of size 3.7 Gb
>
> which is happening due to the occurrence of three idvar variables. This
> works basically as interaction(x,y,z)[,drop=TRUE], i.e. it first creates a
> factor with 999^3 levels, and removes the empty levels afterward.
>
> In the absense of a better interaction(), you might try making your own
> single idvar as do.call("paste",tbl[,c("ID", "DATE1", "DATE2")]) or so.

There's also ninteraction in the plyr package, which has been designed
to generate a unique integer for each combination (while maintaining
the original order of the data and any missing combinations) as
efficiently as possible.  It's much faster than interaction(..., drop
= T) and I hope it would be faster than paste since it works with
integers rather than strings.

Hadley

-- 
http://had.co.nz/

Comment 3 Jitterbug compatibility account 2010-01-12 22:21:00 UTC
NOTES:
 is using a lot of memory a bug?
uses less memory in 2.11.0
Comment 4 Jitterbug compatibility account 2010-01-12 22:21:46 UTC
Audit (from Jitterbug):
Thu Dec 17 07:03:51 2009	ripley	changed notes
Tue Jan 12 16:21:45 2010	ripley	changed notes
Tue Jan 12 16:21:46 2010	ripley	moved from incoming to Misc-fixed