Bug 16895

Summary: Patch to 'table': not calling 'factor' if levels is not changed
Product: R Reporter: Suharto Anggono <suharto_anggono>
Component: WishlistAssignee: R-core <R-core>
Status: CLOSED FIXED    
Severity: enhancement CC: maechler
Priority: P5    
Version: R-devel (trunk)   
Hardware: All   
OS: All   
Attachments: In 'table' and 'addNA', call 'factor' only if levels is changed
In 'table' and 'addNA', call 'factor' only if levels is changed
In 'table', call 'factor' only if levels is changed
In 'addNA', not call 'factor' if there is NA in levels(x) and no NA in 'x'

Description Suharto Anggono 2016-05-07 15:34:53 UTC
Created attachment 2083 [details]
In 'table' and 'addNA', call 'factor' only if levels is changed

In function 'table' in R before version 2.8.0, in default case where 'exclude' is not specified, inputs that are already factors are left as is. It is not so in current R. Function 'factor' is applied.

Calling function 'factor' takes time. This patch attempts to avoid it if levels will not changed after 'factor' is applied.
Comment 1 Suharto Anggono 2016-05-07 15:37:30 UTC
Created attachment 2084 [details]
In 'table' and 'addNA', call 'factor' only if levels is changed
Comment 2 Suharto Anggono 2016-05-07 15:57:55 UTC
Created attachment 2085 [details]
In 'table', call 'factor' only if levels is changed

Patched 'addNA' by my previous patch gives different result from current 'addNA' when there is NA in 'x' and also in levels(x). So, now I don't propose it.
Comment 3 Suharto Anggono 2016-05-29 09:56:42 UTC
Created attachment 2097 [details]
In 'addNA', not call 'factor' if there is NA in levels(x) and no NA in 'x'
Comment 4 Martin Maechler 2016-05-30 12:30:08 UTC
(In reply to Suharto Anggono from comment #3)
> Created attachment 2097 [details]
> In 'addNA', not call 'factor' if there is NA in levels(x) and no NA in 'x'

Sounds reasonable and innocuous... and easily checked the speed improvement. I've committed it (to R-devel only) 
where now

ff <- gl(50,20); identical(addNA(ff), addNA(addNA(ff))); (mb <- microbenchmark(addNA(ff), addNA(addNA(ff))))
[1] TRUE
Unit: microseconds
             expr    min      lq     mean  median      uq     max neval cld
        addNA(ff) 34.265 35.8410 41.83388 37.4735 40.1605  95.200   100  a 
 addNA(addNA(ff)) 41.365 43.4155 52.41979 45.4365 55.8495 112.265   100   b
> 

and before, the double call also cost "double time".
Comment 5 Suharto Anggono 2016-05-31 15:38:17 UTC
I note here what I've just found.
If 'x' is a factor, factor(x, levels = levels(x), exclude = NULL) is different from 'x' if 'x' has duplicated levels and any position where duplicated(levels(x)) is TRUE (not the first appearance) is used. 
So, patches here change 'table'/'addNA' behavior on such a factor.
Comment 6 Martin Maechler 2016-05-31 21:10:49 UTC
(In reply to Suharto Anggono from comment #5)
> I note here what I've just found.
> If 'x' is a factor, factor(x, levels = levels(x), exclude = NULL) is
> different from 'x' if 'x' has duplicated levels and any position where
> duplicated(levels(x)) is TRUE (not the first appearance) is used. 
> So, patches here change 'table'/'addNA' behavior on such a factor.

interesting.. however (in another thread) we've recently mentioned that
factors with duplicated levels should be considered invalid.

They can always be constructed by diligent programmeRs, if necessary even via .Call(), but I think we are allowed to signal errors wherever we encounter such beasts.

Here is a relevant message from a whole thread on the topic of duplcated factor levels, and about "R core has decided .. to abolish factors with duplicated levels .. for R 2.10.0":
https://stat.ethz.ch/pipermail/r-devel/2009-May/053269.html
Comment 7 Martin Maechler 2018-08-02 19:18:10 UTC
I had implemented a version of the addNA() proposal, back then.

For table(), we have changed it too much in the mean time, and have possibly solved the problem // got the speedup in related ways.

We'd need new repr.ex's and patches if at all .. -> closing this bug report as "fixed" (we had done 1/2 of the proposal).