Bug 16924 - Wishlist: reduce additional time taken by 'table' over 'tabulate'
Summary: Wishlist: reduce additional time taken by 'table' over 'tabulate'
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R 3.3.0
Hardware: ix86 (32-bit) Windows 32-bit
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-05-27 16:40 UTC by Suharto Anggono
Modified: 2016-09-12 00:14 UTC (History)
0 users

See Also:


Attachments
'table': 'bin': start from 1, NA not removed (592 bytes, patch)
2016-05-27 16:42 UTC, Suharto Anggono
Details | Diff
'table': 'bin': start from argument 1, NA not removed (1.18 KB, patch)
2016-05-27 16:52 UTC, Suharto Anggono
Details | Diff
Against R devel r71154, 'table': 'bin': start from argument 1, NA not removed (1.16 KB, patch)
2016-08-26 16:18 UTC, Suharto Anggono
Details | Diff
Timing script, 1 factor argument, default (4.72 KB, text/plain)
2016-09-12 00:14 UTC, Suharto Anggono
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Suharto Anggono 2016-05-27 16:40:15 UTC
The workhorse of function 'table' in R is function 'tabulate'. 'tabulate' is fast. Could additional time taken by 'table' be reduced?

Another proposal for speeding up 'table' is https://stat.ethz.ch/pipermail/r-devel/2013-August/067105.html, followed up by
https://stat.ethz.ch/pipermail/r-devel/2013-September/067509.html

Bug 16895 deals with an issue that arises from change in code of function 'table' in R 2.8.0.

Aside from what is in Bug 16895, I've found room for speed improvement in code of function 'table' in R.
- Because 'tabulate' ignores NA (documented), it is not necessary to remove NA before applying 'tabulate'.
- In calculation of 'bin', instead of starting from 0 and finishing by adding 1, 'bin' could directly start from 1, like 'group' in the code of function 'tapply' before R 3.3.0.
Comment 1 Suharto Anggono 2016-05-27 16:42:46 UTC
Created attachment 2095 [details]
'table': 'bin': start from 1, NA not removed
Comment 2 Suharto Anggono 2016-05-27 16:52:33 UTC
Created attachment 2096 [details]
'table': 'bin': start from argument 1, NA not removed

Calculation of 'bin' here is like 'group' in the code of function 'tapply' in R 3.3.0.
Comment 3 Suharto Anggono 2016-08-26 16:18:13 UTC
Created attachment 2143 [details]
Against R devel r71154, 'table': 'bin': start from argument 1, NA not removed
Comment 4 Suharto Anggono 2016-09-12 00:14:02 UTC
Created attachment 2150 [details]
Timing script, 1 factor argument, default

The setup is approximately as in https://stat.ethz.ch/pipermail/r-devel/2013-August/067105.html .

Result:

R> set.seed(123)
R> f <- factor(sample(c(1:9, NA), 1e6, replace=TRUE), 1:9)
R> system.time(table(f))
   user  system elapsed
   0.16    0.05    0.21
R> system.time(table(f))
   user  system elapsed
   0.19    0.00    0.18
R> system.time(table.new2(f))
   user  system elapsed
   0.00    0.00    0.01
R> system.time(table.new2(f))
   user  system elapsed
      0       0       0
R> identical(table(f), table.new2(f))
[1] TRUE
R> f <- factor(sample(1e6), seq(1e6))
R> system.time(table(f))
   user  system elapsed
   0.41    0.04    0.44
R> system.time(table(f))
   user  system elapsed
   0.16    0.03    0.20
R> system.time(table.new2(f))
   user  system elapsed
   0.06    0.03    0.11
R> system.time(table.new2(f))
   user  system elapsed
   0.11    0.00    0.11
R> identical(table(f), table.new2(f))
[1] TRUE
R> sessionInfo()
R Under development (unstable) (2016-09-10 r71232)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows XP (build 2600) Service Pack 2

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] compiler  stats     graphics  grDevices utils     datasets  methods
[8] base