Bug 17118

Summary: Wishlist: Make sapply(x, f) not much slower for named 'x' and long f(x[[i]])
Product: R Reporter: Suharto Anggono <suharto_anggono>
Component: WishlistAssignee: R-core <R-core>
Status: CLOSED FIXED    
Severity: enhancement CC: maechler
Priority: P5    
Version: R 3.3.*   
Hardware: ix86 (32-bit)   
OS: Windows 32-bit   
URL: http://stackoverflow.com/questions/12188509/cleaning-inf-values-from-an-r-dataframe
Attachments: simplify2array: use unlist(use.names=FALSE) for common.len>1

Description Suharto Anggono 2016-07-12 16:41:05 UTC
Created attachment 2126 [details]
simplify2array: use unlist(use.names=FALSE) for common.len>1

Timings in answer by mnel in http://stackoverflow.com/questions/12188509/cleaning-inf-values-from-an-r-dataframe demonstrates that using 'sapply' may slow things down noticeably.

A modified example:

R> dat <- list(a = rep(c(1,Inf), 1e5), b = rep(c(Inf,2), 1e5),
R+ c = rep(c('a','b'), 1e5), d = rep(c(1,Inf), 1e5),
R+ e = rep(c(Inf,2), 1e5))
R> system.time(sapply(dat, is.infinite))
   user  system elapsed
   3.27    0.05    3.34
R> system.time(sapply(dat, is.infinite))
   user  system elapsed
   2.05    0.01    2.07
R> system.time(lapply(dat, is.infinite))
   user  system elapsed
   0.01    0.00    0.02
R> system.time(lapply(dat, is.infinite))
   user  system elapsed
   0.03    0.00    0.03
R> system.time(do.call(cbind, lapply(dat, is.infinite)))
   user  system elapsed
   0.04    0.00    0.03
R> system.time(do.call(cbind, lapply(dat, is.infinite)))
   user  system elapsed
   0.05    0.00    0.03
R> system.time(vapply(dat, is.infinite, logical(length(dat[[1]]))))
   user  system elapsed
   0.03    0.00    0.03
R> system.time(vapply(dat, is.infinite, logical(length(dat[[1]]))))
   user  system elapsed
   0.03    0.00    0.03
R> dat2 <- dat; names(dat2) <- NULL
R> system.time(sapply(dat2, is.infinite))
   user  system elapsed
   0.01    0.00    0.03
R> system.time(sapply(dat2, is.infinite))
   user  system elapsed
   0.04    0.00    0.03

When being applied to 'dat2' that doesn't have names, 'sapply' is much faster.


R> system.time(unlist(lapply(dat, is.infinite),
R+ recursive = FALSE))
   user  system elapsed
   2.85    0.00    2.95
R> system.time(unlist(lapply(dat, is.infinite),
R+ recursive = FALSE))
   user  system elapsed
   2.26    0.00    2.31
R> system.time(unlist(lapply(dat, is.infinite),
R+ recursive = FALSE, use.names = FALSE))
   user  system elapsed
   0.03    0.00    0.03
R> system.time(unlist(lapply(dat, is.infinite),
R+ recursive = FALSE, use.names = FALSE))
   user  system elapsed
   0.03    0.00    0.03

Above, it seems that 'unlist' takes time.
'sapply' calls 'simplify2array'. In code of function 'simplify2array', 'unlist' is used.
unlist(use.names = FALSE) is much faster.
In 'simplify2array', for common.len > 1L, unlist(use.names = FALSE) could be used instead. If simplification is done, function 'array' is applied afterwards, and names in the 'unlist' result is not used.


R> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows XP (build 2600) Service Pack 2

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_3.3.1
Comment 1 Martin Maechler 2016-07-14 09:45:52 UTC
Thank you, Suharto,  this is much appreciated!

I've checked the proposal including with all recommended packages, and I can't think of a case where it fails.

Committed to R-devel (only).