Bug 16580 - format.data.frame() looses a column named "stringsAsFactors"
Summary: format.data.frame() looses a column named "stringsAsFactors"
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-10-27 06:34 UTC by Hervé Pagès
Modified: 2016-01-07 16:24 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hervé Pagès 2015-10-27 06:34:14 UTC
Create a data frame with a column named "stringsAsFactors":

  > df <- as.data.frame(list(a=1:4, b=letters[1:4], stringsAsFactors=FALSE))

'df' is a data frame with 3 columns, the 3rd column being the "stringsAsFactors" column:

  > dim(df)
  [1] 4 3

  > names(df)
  [1] "a"                "b"                "stringsAsFactors"

  > df$stringsAsFactors
  [1] FALSE FALSE FALSE FALSE

However, print.data.frame() doesn't display the 3rd column:

  > df
    a b
  1 1 a
  2 2 b
  3 3 c
  4 4 d

This is because print.data.frame() uses format.data.frame() internally and
the latter is broken if the data frame contains a "stringsAsFactors" column:

  > format.data.frame(df)
    a b
  1 1 a
  2 2 b
  3 3 c
  4 4 d

  > dim(format.data.frame(df))
  [1] 4 2

  > names(format.data.frame(df))
  [1] "a" "b"

This bug is present in R 3.0, 3.1, 3.2, and devel.

The culprit is line 250 in trunk/src/library/base/R/format.R:

  m <- match(c("row.names", "check.rows", "check.names", ""), cn, 0L)

which should be:

  m <- match(c("row.names", "check.rows", "check.names", "stringsAsFactors", ""), cn, 0L)

Note that, except for the colname truncating business, lines 249-261 are turning list 'rval' into a data frame and thus are basically doing what as.data.frame.list() does (which BTW uses the same "colname protection trick", see line 177 in trunk/src/library/base/R/dataframe.R). So it looks like there might be an opportunity for code re-use e.g. maybe format.data.frame() could use as.data.frame.list()?

Thanks,
H.

PS: Michael if you see this, this is what breaks show() on a DataFrame in S4Vectors:

  > DataFrame(a=1:4, b=letters[1:4], stringsAsFactors=FALSE)
  DataFrame with 4 rows and 3 columns
  Error in matrix(unlist(lapply(object, function(x) { : 
    length of 'dimnames' [2] not equal to array extent
Comment 1 Martin Maechler 2015-10-27 10:50:18 UTC
Thank you, Herve for the valid bug report

*And* indeed, re-using  as.data.frame.list() seems a good idea to me.

I'm adding an optional 'cut.names = FALSE' argument to that, so
 format.data.frame() will call
 
   as.data.frame.list(rval, cut.names = TRUE)
Comment 2 Martin Maechler 2015-10-29 16:15:56 UTC
I did follow Herve's proposal of  format.data.frame()
in the end just calling  as.data.frame.list(...).

But that call needed extra arguments, and *both*
as.data.frame.list() and the underlying data.frame() needed to gain new options.

Hence this change is only to R-devel for a while now, at least.

------------------------------------------------------------------------
r69582 | maechler | 2015-10-29 17:12:54 +0100 (Thu, 29 Oct 2015) | 2 lines
Changed paths:
   M doc/NEWS.Rd
   M src/library/base/R/dataframe.R
   M src/library/base/R/format.R
   M src/library/base/man/as.data.frame.Rd
   M src/library/base/man/data.frame.Rd
   M tests/reg-tests-1c.R

PR#16580: data frames with column name "StringsAsFactors" now format and print correctly;
  data.frame() gains argument `fix.empty.names` and as.data.frame.list() gets new `cut.names`, `col.names` and `fix.empty.names`.
------------------------------------------------------------------------
Comment 3 Hervé Pagès 2015-10-29 16:46:58 UTC
These changes sound good. Just a minor cosmetic thing: I only had a quick look at the diff and saw StringsAsFactors instead of stringsAsFactors in the NEWS.Rd file. Thanks!
Comment 4 Mike Jiang 2015-11-12 20:04:17 UTC
I think there is a bug introduced by as.data.frame.list. Here is reproducible example:

  > library(flowCore)
  > data(GvHD)
  > frame <- GvHD[[1]]
  > dd <- pData(parameters(frame))
  > dd
       name              desc range minRange maxRange
  $P1 FSC-H        FSC-Height  1024        0     1023
  ...
  $P6 FL2-A              <NA>  1024        0     1023
  $P7 FL4-H          CD33 APC  1024        1    10000
  ...

Note that the 6th row has NA 'desc'
Now we subset on the 'desc' column and it's supposed to return a legitimate data.frame as below:

  > dd <- dd[,2,drop = F]
  > dd
                   desc
  $P1        FSC-Height
  ...
  $P6              <NA>
  $P7          CD33 APC
  ...

However with the latest changes, we see the following error
  > dd
  Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =    TRUE,  : 
    row names contain missing values

I've looked at the code and it throws at the following call:

   x <- do.call(data.frame, c(x, list(check.names = check.n, 
           fix.empty.names = fix.empty.names, stringsAsFactors =    stringsAsFactors)))

Looks like this will expect the non-NA value from each column, which is unnecessary.
Comment 5 Martin Maechler 2015-11-13 10:48:28 UTC
(In reply to Mike Jiang from comment #4)
> I think there is a bug introduced by as.data.frame.list. Here is
> reproducible example:
> 
>   > library(flowCore)
>   > data(GvHD)
>   > frame <- GvHD[[1]]

 .............

> 
> However with the latest changes, we see the following error
>   > dd
>   Error in (function (..., row.names = NULL, check.rows = FALSE, check.names
> =    TRUE,  : 
>     row names contain missing values


Hmm.. a somewhat large bioconductor package (with dependencies!) needs to
be installed [from a server which is slow to the non-US part of the world]
to make this reproducible.
In short, your dd  ends up being identical to the one made with

desc <- structure(c("FSC-Height", "SSC-Height", "CD15 FITC", "CD45 PE",
                    "CD14 PerCP", NA, "CD33 APC", "Time (51.20 sec.)"),
       .Names = c("$P1S", "$P2S", "$P3S", "$P4S", "$P5S", NA, "$P7S", "$P8S"),
                  class = "AsIs")

dd <- data.frame(desc = desc,
                 row.names = c("$P1","$P2","$P3","$P4","$P5","$P6","$P7","$P8"))


where you should note that the *names* of the 'desc' column of the data
frame do have an NA and they *differ* from the row.names of the data frame
(which have no NA).

I'd say that this is not a legal data.frame.
For instance, for the above dd,

  as.data.frame(unclass(dd))

gives the Error

> as.data.frame(unclass(dd))
Error in data.frame(desc = c("FSC-Height", "SSC-Height", "CD15 FITC",  : 
  row names contain missing values

not only in the latest 'R-devel' but also in previous versions of R.

So on valid conclusion could be: The change which fixed the bug and made
format.data.frame() more consistent by building on top of
as.data.frame.list(.) has *revealed* a bug in the   'flowCore's package
data structures / data handling ...

Another conclusion could be:
The "AsIs" columns of a data frame are allowed to be as strange as they want,
without rendering the data frame illegal.
=> format.data.frame() and hence print() should be able to work "fine" with
such a beast [by *not* looking at the names(.) of those columns].

I'm wondering what others think. Also, where you aware the names of your
'desc' column where contradicting the data frame row names ?
If not, the change has revealed a "design bug" somewhere in your code or
data structures, and hence such a change would be desirable, right?

-------
To continue, let's all use this  (almost) minimal reproducible example:


## Minimal reproducible example:

desc <- structure( c("a", NA, "z"), .Names = c("A", NA, "Z"))
## This gives an error:
dd <- data.frame(desc = desc, stringsAsFactors = FALSE)
## however
dd <- data.frame(desc = structure(desc, class="AsIs"),
                 row.names = c("A","M","Z"), stringsAsFactors = FALSE)
## "works", but does not print in R-devel
dd
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
##   row names contain missing values
Comment 6 Martin Maechler 2015-11-13 10:56:40 UTC
(In reply to Martin Maechler from comment #5)

> Another conclusion could be:
> The "AsIs" columns of a data frame are allowed to be as strange as they want,
> without rendering the data frame illegal.
> => format.data.frame() and hence print() should be able to work "fine" with
> such a beast [by *not* looking at the names(.) of those columns].

  ?I   in R  documents a bit the semantic of the "AsIs" class

and notably mentions that it can have it's own names .. which hence are allowed to differ from the data frame row.names.

From that I conclude that the above "alternative conclusion" is the correc one, and you are right that the new code is bogous in this case.

I will fix this ASAP.
Martin
Comment 7 Martin Maechler 2015-11-13 16:17:57 UTC
(In reply to Martin Maechler from comment #6)

> I will fix this ASAP.

Fixed in  svn r  69633 .... The change is only in 
as.data.frame.list()  and actually the new code is much simpler than previous one... so with thanks to Mike Jiang  who has triggered this change.
Comment 8 Martin Maechler 2016-01-07 16:24:03 UTC
*** Bug 16658 has been marked as a duplicate of this bug. ***