Bug 16895 - Patch to 'table': not calling 'factor' if levels is not changed
Summary: Patch to 'table': not calling 'factor' if levels is not changed
Status: ASSIGNED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-05-07 15:34 UTC by Suharto Anggono
Modified: 2016-05-31 21:10 UTC (History)
1 user (show)

See Also:


Attachments
In 'table' and 'addNA', call 'factor' only if levels is changed (1.89 KB, text/plain)
2016-05-07 15:34 UTC, Suharto Anggono
Details
In 'table' and 'addNA', call 'factor' only if levels is changed (1.89 KB, patch)
2016-05-07 15:37 UTC, Suharto Anggono
Details | Diff
In 'table', call 'factor' only if levels is changed (1.03 KB, patch)
2016-05-07 15:57 UTC, Suharto Anggono
Details | Diff
In 'addNA', not call 'factor' if there is NA in levels(x) and no NA in 'x' (422 bytes, patch)
2016-05-29 09:56 UTC, Suharto Anggono
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Suharto Anggono 2016-05-07 15:34:53 UTC
Created attachment 2083 [details]
In 'table' and 'addNA', call 'factor' only if levels is changed

In function 'table' in R before version 2.8.0, in default case where 'exclude' is not specified, inputs that are already factors are left as is. It is not so in current R. Function 'factor' is applied.

Calling function 'factor' takes time. This patch attempts to avoid it if levels will not changed after 'factor' is applied.
Comment 1 Suharto Anggono 2016-05-07 15:37:30 UTC
Created attachment 2084 [details]
In 'table' and 'addNA', call 'factor' only if levels is changed
Comment 2 Suharto Anggono 2016-05-07 15:57:55 UTC
Created attachment 2085 [details]
In 'table', call 'factor' only if levels is changed

Patched 'addNA' by my previous patch gives different result from current 'addNA' when there is NA in 'x' and also in levels(x). So, now I don't propose it.
Comment 3 Suharto Anggono 2016-05-29 09:56:42 UTC
Created attachment 2097 [details]
In 'addNA', not call 'factor' if there is NA in levels(x) and no NA in 'x'
Comment 4 Martin Maechler 2016-05-30 12:30:08 UTC
(In reply to Suharto Anggono from comment #3)
> Created attachment 2097 [details]
> In 'addNA', not call 'factor' if there is NA in levels(x) and no NA in 'x'

Sounds reasonable and innocuous... and easily checked the speed improvement. I've committed it (to R-devel only) 
where now

ff <- gl(50,20); identical(addNA(ff), addNA(addNA(ff))); (mb <- microbenchmark(addNA(ff), addNA(addNA(ff))))
[1] TRUE
Unit: microseconds
             expr    min      lq     mean  median      uq     max neval cld
        addNA(ff) 34.265 35.8410 41.83388 37.4735 40.1605  95.200   100  a 
 addNA(addNA(ff)) 41.365 43.4155 52.41979 45.4365 55.8495 112.265   100   b
> 

and before, the double call also cost "double time".
Comment 5 Suharto Anggono 2016-05-31 15:38:17 UTC
I note here what I've just found.
If 'x' is a factor, factor(x, levels = levels(x), exclude = NULL) is different from 'x' if 'x' has duplicated levels and any position where duplicated(levels(x)) is TRUE (not the first appearance) is used. 
So, patches here change 'table'/'addNA' behavior on such a factor.
Comment 6 Martin Maechler 2016-05-31 21:10:49 UTC
(In reply to Suharto Anggono from comment #5)
> I note here what I've just found.
> If 'x' is a factor, factor(x, levels = levels(x), exclude = NULL) is
> different from 'x' if 'x' has duplicated levels and any position where
> duplicated(levels(x)) is TRUE (not the first appearance) is used. 
> So, patches here change 'table'/'addNA' behavior on such a factor.

interesting.. however (in another thread) we've recently mentioned that
factors with duplicated levels should be considered invalid.

They can always be constructed by diligent programmeRs, if necessary even via .Call(), but I think we are allowed to signal errors wherever we encounter such beasts.

Here is a relevant message from a whole thread on the topic of duplcated factor levels, and about "R core has decided .. to abolish factors with duplicated levels .. for R 2.10.0":
https://stat.ethz.ch/pipermail/r-devel/2009-May/053269.html