Bug 16936 - 'table' drops "NaN" factor level by default, not as documented
Summary: 'table' drops "NaN" factor level by default, not as documented
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Analyses (show other bugs)
Version: R 3.3.0
Hardware: ix86 (32-bit) Windows 32-bit
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-06-04 14:06 UTC by Suharto Anggono
Modified: 2016-07-31 06:18 UTC (History)
1 user (show)

See Also:


Attachments
table: for exclude on factor, directly change integer codes (2.40 KB, patch)
2016-07-23 12:15 UTC, Suharto Anggono
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Suharto Anggono 2016-06-04 14:06:51 UTC
The documentation on 'table', in "Details" section, says:
Only when 'exclude' is specified and non-NULL (i.e., not by default), will 'table' potentially drop levels of factor arguments.

In fact, when 'exclude' is unspecified, "NaN" factor level disappears in the result of 'table'. An example as in https://stat.ethz.ch/pipermail/r-devel/2011-January/059652.html :

R> table(factor(c("NA",NA,"NcN","NbN", "NaN")))

 NA NbN NcN
  1   1   1
R> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows XP (build 2600) Service Pack 2

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_3.3.0

It happens after change in code of function 'table' in R 2.8.0.
In code of function 'table' in R 2.7.2, condition that leads to a factor argument left unchanged is missing(exclude).
In code of function 'table' in R 3.3.0, condition that leads to a factor argument left unchanged is any(is.na(levels(a))), where 'a' is the argument.

On the other side, as mentioned in https://stat.ethz.ch/pipermail/r-help/2009-April/389286.html , a strange undocumented 'table' behavior because of checking any(is.na(levels(a))) is that 'exclude' does not work for factor arguments with NA as a level.

R> x <- factor(c(rep(c("A","B","C"), 2), NA), exclude = NULL)
R> x
[1] A    B    C    A    B    C    <NA>
Levels: A B C <NA>
R> table(x, exclude = "B")
x
   A    B    C <NA>
   2    2    2    1

Tests in function 'test_table' in https://stat.ethz.ch/pipermail/r-devel/2013-August/067105.html that fails for 'table' are also for factors with NA as a level.
Comment 1 Suharto Anggono 2016-06-04 16:18:48 UTC
(In reply to Suharto Anggono from comment #0)
> In code of function 'table' in R 3.3.0, condition that leads to a factor
> argument left unchanged is any(is.na(levels(a))), where 'a' is the argument.

My guess is that the treatment was because applying factor(exclude = NA) drops NA factor level; applying factor(exclude = NULL) change NA to be associated with NA factor level. Both side effects was unintended.
Comment 2 Suharto Anggono 2016-07-23 12:15:43 UTC
Created attachment 2133 [details]
table: for exclude on factor, directly change integer codes

This touches the part marked with "Don't touch this!".

Because this leaves factor arguments as is by default, as in R 2.7.2, this resolves the problem in Description of Bug 16895.

For exclude on a factor argument, because this operates on the integer codes, NA as a level is not a problem. No special treatment for factor with NA as a level, and 'exclude' works on the case.

This doesn't apply 'addNA' if NA is already a level.
Comment 3 Martin Maechler 2016-07-26 10:24:56 UTC
I'm starting to investigate the problem, and notably your attachment;
thank you, Suharto !

This *is* a somewhat delicate issue -- notably as you mention the 
   "Don't touch this!"
being touched.
 
(It may be good if other readers of this could get involved too and we could  start looking into it partly by e-mail ..)
Comment 4 Martin Maechler 2016-07-27 09:33:24 UTC
(In reply to Martin Maechler from comment #3)
> I'm starting to investigate the problem, and notably your attachment;
> thank you, Suharto !
> 
> This *is* a somewhat delicate issue -- notably as you mention the 
>    "Don't touch this!"
> being touched.
>  
> (It may be good if other readers of this could get involved too and we could
> start looking into it partly by e-mail ..)

No reaction yet.
I've tested the patch now ('make check-all'), i.e., including all the recommended packages and I have not seen any adverse effect.

Hence I'm committing the bug fix (your patch + regression test + news) to R-devel  -- for now. A backport to "R 3.3.1 patched" may make sense, if we don't hear of cases where the change would have harmed.
Comment 5 Suharto Anggono 2016-07-27 15:34:47 UTC
(In reply to Suharto Anggono from comment #2)
> Created attachment 2133 [details]
> table: for exclude on factor, directly change integer codes
> 

Cosmetic: Using variable name 'keep' in place of 'used' looks a little better to me.

> 
> This doesn't apply 'addNA' if NA is already a level.

For a factor with NA as a level, applying and not applying 'addNA' makes a difference if there is also NA value.

Example:
x <- factor(c(1, 2, NA), exclude = NULL)
is.na(x)[2] <- TRUE
table(x, useNA = "always")

In the result, count of NA is 1.
If 'addNA' were applied, count of NA would be 2 in the result.
Comment 6 Martin Maechler 2016-07-29 13:18:08 UTC
(In reply to Suharto Anggono from comment #5)
> 
> Cosmetic: Using variable name 'keep' in place of 'used' looks a little
> better to me.

Yes, I agree;  but there are other "obvious" simplifications which I will do
and commit *before* addressing the problem below :


> > 
> > This doesn't apply 'addNA' if NA is already a level.
> 
> For a factor with NA as a level, applying and not applying 'addNA' makes a
> difference if there is also NA value.
> 
> Example:
> x <- factor(c(1, 2, NA), exclude = NULL)
> is.na(x)[2] <- TRUE
> table(x, useNA = "always")
> 
> In the result, count of NA is 1.
> If 'addNA' were applied, count of NA would be 2 in the result.

and that would be correct.
I agree this is another bug.  I now have investigated and found that this bug was introduced at the time, the new  'useNA' argument was introduced, i.e., in summer 2008.   I strongly believe this should be changed, even though it has been in R since R 2.8.0.
Comment 7 Martin Maechler 2016-07-30 09:05:30 UTC
(In reply to Martin Maechler from comment #6)
> (In reply to Suharto Anggono from comment #5)
 
[.............]

> I agree this is another bug.  I now have investigated and found that this
> bug was introduced at the time, the new  'useNA' argument was introduced,
> i.e., in summer 2008.   I strongly believe this should be changed, even
> though it has been in R since R 2.8.0.

Fix committed  (to R-devel, svn r71009, only, for now)
Comment 8 Suharto Anggono 2016-07-31 00:39:48 UTC
In function 'table' in R devel r71012, condition to apply 'addNA' is
useNA != "no" && (anyNA(a) || !anyNA(levels(a))) .
Using just
useNA != "no"
actually works. Function 'addNA' already takes care of everything. The code of function 'table' in R 3.3.1 does that way for 'a' that is originally not a factor.
Comment 9 Suharto Anggono 2016-07-31 06:18:52 UTC
(In reply to Suharto Anggono from comment #2)
> Created attachment 2133 [details]
> table: for exclude on factor, directly change integer codes
> 
> For exclude on a factor argument, because this operates on the integer
> codes, NA as a level is not a problem. No special treatment for factor with
> NA as a level, and 'exclude' works on the case.

For exclude on a factor argument, if factors that has NA as a level and also NA value is out of concern, or if it is OK to change NA to be associated with NA factor level (with useNA = "no"), one may as well proceed by something like
    ll <- levels(a)
    a <- factor(a, levels = ll[!(ll %in% exclude)], exclude = NULL)
(not preceded by  a <- as.integer(a) ).
Something like that is in the code of function 'xtabs' in R 3.3.1.