Bug 16658 - rbind.data.frame() should get an `stringsAsFactors` argument
Summary: rbind.data.frame() should get an `stringsAsFactors` argument
Status: ASSIGNED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-01-07 13:06 UTC by Peter Meissner
Modified: 2016-01-12 18:39 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Meissner 2016-01-07 13:06:50 UTC
rbind() will show somewhat unexpected behavior when used with empty data frames. 


While something like this works as expected - i.e. producing character columns:



lapply( 
  rbind(data.frame("b", stringsAsFactors=FALSE), "a"), 
  class 
)

## $X.b.
## [1] "character"



The request to use stringsAsFactors=FALSE is ignored in the case of an empty data frame: 



lapply( 
  rbind(data.frame(     stringsAsFactors=FALSE), "a"), 
  class 
)

## $X.a.
## [1] "factor"



On the other hand using options() to change the default will result in character columns:

options("stringsAsFactors"=FALSE)

lapply( 
  rbind(data.frame(     ), "a"), 
  class 
)

## $X.a.
## [1] "character"


lapply( 
  rbind(data.frame(     stringsAsFactors=FALSE), "a"), 
  class 
)

## $X.a.
## [1] "character"





INFO:

sessionInfo()

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows >= 8 x64 (build 9200)
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
## [1] tools_3.2.3
Comment 1 Martin Maechler 2016-01-07 16:24:03 UTC
This has been reported a while ago,
and also been fixed....   for R-devel only (to be R 3.3.0 somewhere in April 2016) because the fix involved quite a few "cleanup" / "re-organization" changes in that part of data frame utility code.

*** This bug has been marked as a duplicate of bug 16580 ***
Comment 2 Martin Maechler 2016-01-07 16:37:22 UTC
I'm sorry: This is *not* a duplicate of PR#16580  and the symptoms you show are still present in R-devel.
Comment 3 Martin Maechler 2016-01-07 16:55:08 UTC
(In reply to Peter Meissner from comment #0)

Your reproducible code is complicating detecting the problem a bit:
It's easy to see that the 'stringsAsFactors' argument of data.frame() cannot have any effect in the empty case:

Look at this:

identical3 <- function(x,y,z) identical(x,y) & identical(y,z)
identical3(data.frame(),
           data.frame(stringsAsFactors=TRUE),
           data.frame(stringsAsFactors=FALSE)) #  TRUE

op <- options(stringsAsFactors=FALSE)
identical3(data.frame(),
           data.frame(stringsAsFactors=TRUE),
           data.frame(stringsAsFactors=FALSE)) # TRUE
options(op) # revert

---
So  stringsAsFactor has no effect on empty data frames, and "of course"!
An empty data frame has neither "character" nor "factor" entries.

What you observe is the behavior of rbind.data.frame()  and indeed that *is* influenced -- via as.data.frame() -- by the global option 'stringAsFactors'.

Honestly, I believe that the introduction of  options(stringsAsFactors = *)
has been by far the biggest mistake in the R-core based history of R,
and this bug is just another consequence of that mistaken decision.
Instead of the option, an explicit *argument*  stringsAsFactors should be part of the argument list of all functions that require it and be passed down to further functions etc.

The fix for this bug will also have to do this.. and be somewhat unrewarding to do...
Comment 4 Martin Maechler 2016-01-07 17:00:49 UTC
(In reply to Martin Maechler from comment #3)

> The fix for this bug will also have to do this.. and be somewhat unrewarding
> to do...

Actually, this is *not* a bug, but rather a request for rbind.data.frame() to get
an argument `stringsAsFactors`,  hence  bug category 'enhancement'
Comment 5 Peter Dalgaard 2016-01-07 17:45:30 UTC
I think there actually is a real bug in here:

> dput(rbind(data.frame( X=character(0),    stringsAsFactors=FALSE),"b"))
structure(list(X.b. = structure(1L, .Label = "b", class = "factor")), .Names = "X.b.", row.names = 1L, class = "data.frame")
> dput(rbind(data.frame( X=character(1),    stringsAsFactors=FALSE),"b"))
structure(list(X = c("", "b")), .Names = "X", row.names = 1:2, class = "data.frame")

(with current R-devel)

Notice that rbind.data.frame in both cases could infer that the column is character and not factor, but it doesn't do so in the 0-row case.
Comment 6 Martin Maechler 2016-01-08 21:11:03 UTC
(In reply to Peter Dalgaard from comment #5)
> I think there actually is a real bug in here:
> 
> > dput(rbind(data.frame( X=character(0),    stringsAsFactors=FALSE),"b"))
> structure(list(X.b. = structure(1L, .Label = "b", class = "factor")), .Names
> = "X.b.", row.names = 1L, class = "data.frame")
> > dput(rbind(data.frame( X=character(1),    stringsAsFactors=FALSE),"b"))
> structure(list(X = c("", "b")), .Names = "X", row.names = 1:2, class =
> "data.frame")
> 
> (with current R-devel)
> 
> Notice that rbind.data.frame in both cases could infer that the column is
> character and not factor, but it doesn't do so in the 0-row case.

You are right, Peter,  that *is* an infelicitous inconsistency in 
rbind.data.frame()s behavior..  
I'll look into fixing
Comment 7 Martin Maechler 2016-01-11 15:17:56 UTC
(In reply to Peter Dalgaard from comment #5)
> I think there actually is a real bug in here:
  [........]

As I have added 'stringsAsFactors'  as an explicit argument to 'rbind.data.frame'
now we can look at this a bit more succinctly (and for all cases):
The following is a pure R script -- for almost latest R-devel only! -- with results as comments:


## now where  'stringsAsFactors' is also an explicit argument to rbind() :
str(d0 <- data.frame( X=character(0), stringsAsFactors=FALSE))# $ X: chr
str(d1 <- data.frame( X=character(1), stringsAsFactors=FALSE))# $ X: chr ""

## Next line is PD's bug -- at least there is an inconsistency between line 1 & 3 :
str( rbind(d0, "b", stringsAsFactors= TRUE)[,1] ) # Factor w/ 1 level "b": 1
str( rbind(d0, "b", stringsAsFactors=FALSE)[,1] ) # chr "b"
str( rbind(d1, "b", stringsAsFactors= TRUE)[,1] ) # chr [1:2] "" "b"
str( rbind(d1, "b", stringsAsFactors=FALSE)[,1] ) # chr [1:2] "" "b"

## Compare with
str(d0f <- data.frame( X=character(0), stringsAsFactors=TRUE))# $ X: Factor w/ 0 levels
str(d1f <- data.frame( X=character(1), stringsAsFactors=TRUE))# $ X: Factor w/ 1 le....

## where one could argue that this is also not consistent:
str( rbind(d0f, "b", stringsAsFactors= TRUE)[,1]) # Factor w/ 1 level "b": 1
str( rbind(d0f, "b", stringsAsFactors=FALSE)[,1]) # chr "b"
str( rbind(d1f, "b", stringsAsFactors= TRUE)[,1]) # Factor ""+ Warn. *invalid* .. level -> NA
str( rbind(d1f, "b", stringsAsFactors=FALSE)[,1]) # Factor ""+ Warn. *invalid* .. level -> NA

##--- Further variations: ----------
##    ------------------ Here, 'stringsAsFactors' argument has *NO* influence ever!
## Var I: first 0 rows, then 1
identical(rbind(d0 , d1 , stringsAsFactors=FALSE) -> D1,
          rbind(d0 , d1 , stringsAsFactors= TRUE)); str(D1$X)# chr ""
identical(rbind(d0 , d1f, stringsAsFactors=FALSE) -> D2,
          rbind(d0 , d1f, stringsAsFactors= TRUE)); str(D2$X)# Factor w/ 1 level "": 1
identical(rbind(d0f, d1f, stringsAsFactors=FALSE) -> D3,
          rbind(d0f, d1f, stringsAsFactors= TRUE)); str(D3$X)# Factor w/ 1 level "": 1
identical(rbind(d0f, d1 , stringsAsFactors=FALSE) -> D4,
          rbind(d0f, d1 , stringsAsFactors= TRUE)); str(D4$X)# chr ""
stopifnot(identical(D1, D4),
          identical(D2, D3))

## Var II: first 1 row, then 0 :
identical(rbind(d1 , d0 , stringsAsFactors=FALSE) -> D1,
          rbind(d1 , d0 , stringsAsFactors= TRUE)); str(D1$X)# chr ""
identical(rbind(d1 , d0f, stringsAsFactors=FALSE) -> D2,
          rbind(d1 , d0f, stringsAsFactors= TRUE)); str(D2$X)# chr ""
identical(rbind(d1f, d0f, stringsAsFactors=FALSE) -> D3,
          rbind(d1f, d0f, stringsAsFactors= TRUE)); str(D3$X)# Factor w/ 1 level "": 1
identical(rbind(d1f, d0 , stringsAsFactors=FALSE) -> D4,
          rbind(d1f, d0 , stringsAsFactors= TRUE)); str(D4$X)# Factor w/ 1 level "": 1
stopifnot(identical(D1, D2),
          identical(D3, D4))

## Var III: 1 row, twice
identical(rbind(d1 , d1 , stringsAsFactors=FALSE) -> D1,
          rbind(d1 , d1 , stringsAsFactors= TRUE)); str(D1$X)# chr [1:2] "" ""
identical(rbind(d1 , d1f, stringsAsFactors=FALSE) -> D2,
          rbind(d1 , d1f, stringsAsFactors= TRUE)); str(D2$X)# chr [1:2] "" ""
identical(rbind(d1f, d1f, stringsAsFactors=FALSE) -> D3,
          rbind(d1f, d1f, stringsAsFactors= TRUE)); str(D3$X)# Factor w/ 1 level "": 1 1
identical(rbind(d1f, d1 , stringsAsFactors=FALSE) -> D4,
          rbind(d1f, d1 , stringsAsFactors= TRUE)); str(D4$X)# Factor w/ 1 level "": 1 1
stopifnot(identical(D1, D2),
          identical(D3, D4))

## Var IV: 0 rows, twice
identical(rbind(d0 , d0 , stringsAsFactors=FALSE) -> D1,
          rbind(d0 , d0 , stringsAsFactors= TRUE)); str(D1$X)# chr(0)
identical(rbind(d0 , d0f, stringsAsFactors=FALSE) -> D2,
          rbind(d0 , d0f, stringsAsFactors= TRUE)); str(D2$X)# chr(0)
identical(rbind(d0f, d0f, stringsAsFactors=FALSE) -> D3,
          rbind(d0f, d0f, stringsAsFactors= TRUE)); str(D3$X)# Factor w/ 0 levels:
identical(rbind(d0f, d0 , stringsAsFactors=FALSE) -> D4,
          rbind(d0f, d0 , stringsAsFactors= TRUE)); str(D4$X)# Factor w/ 0 levels:
stopifnot(identical(D1, D2), identical(d0, D1),
          identical(D3, D4), identical(d0f, D3))

-------------------------

Inspite of the noted inconsistencies, I am not 100% sure if and what exactly we would want to change in the above.   The question ends up being what should  stringsAsFactors  exactly mean in rbind.data.frame... or put differently, how should  data frames be rbound when they have matching column names, but for (say one) column name, some data frames have a factor and some have character variable.  The "variations" above show that  'stringsAsFactors'  has *no* influence currently, on the rbind() behavior if we really have 2 data frames
Comment 8 Martin Maechler 2016-01-11 15:50:41 UTC
(In reply to Martin Maechler from comment #4)
> (In reply to Martin Maechler from comment #3)
> 
> > The fix for this bug will also have to do this.. and be somewhat unrewarding
> > to do...
> 
> Actually, this is *not* a bug, but rather a request for rbind.data.frame()
> to get
> an argument `stringsAsFactors`,  hence  bug category 'enhancement'

Comitted to R-devel, last (late) evening:

r69906 | maechler | 2016-01-11 11:37:45 +0100 (Mon, 11 Jan 2016) 

so the original bug has been fixed / enhancement has been implemented.
Comment 9 Peter Dalgaard 2016-01-11 16:16:33 UTC
The crux of the zero-length data.frame issue is in this part of rbind.data.frame:

    allargs <- list(...)
    allargs <- allargs[lengths(allargs) > 0L]

    if (length(allargs)) {
        nr <- vapply(allargs, function(x) if (is.data.frame(x)) 
            .row_names_info(x, 2L)
        else if (is.list(x)) 
            length(x[[1L]])
        else length(x), 1L)
        if (any(nr > 0L)) 
            allargs <- allargs[nr > 0L]
        else return(allargs[[1L]])
    }
 

this removes any 0-row data frame before its types/classes get a chance to influence the result. This _is_ as documented (but might not be desirable):

     The ‘rbind’ data frame method first drops all zero-column and
     zero-row arguments.  (If that leaves none, [...]
Comment 11 Martin Maechler 2016-01-12 16:57:59 UTC
(In reply to Martin Maechler from comment #7)

I am now (again) claiming that there is no remaining bug (about this), apart from missing documentation which should explain (a version of) the following:

'stringsAsFactors' has no effect in 'rbind.data.frame()' unless in the situation,
where the objects to be rbound all but one have 0 rows, and the one is a "string". Then, the string is treated according to 'stringsAsFactors' to become either a factor or character ("string") column in the resulting dataframe.

In other cases, when data frames are rbound with a column that is 'character' in one data frame and a 'factor' in another, in the case of two data frames, the following rules are used:
1) If one of the two has zero rows and the other not, the other determines the result class.
2) otherwise, the first data frame determines the result class.

---

Ok, I wrote this yesterday, before reading Peter's new comment which clearly mentions most of the above behavior at least indirectly.
We'll have to decide is the code should be changed;  at the moment I tend *not* to.
Comment 12 Peter Meissner 2016-01-12 18:39:54 UTC
You, are the experts but following the discussion, taking the first non-zero-row data frame as reference sounds reasonable to me as well.