Bug 14084 - Possible bug in "unsplit"
Possible bug in "unsplit"
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Misc
old
All All
: P5 normal
Assigned To: Jitterbug compatibility account
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-11-25 20:27 UTC by Jitterbug compatibility account
Modified: 2009-11-25 20:27 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jitterbug compatibility account 2009-11-25 20:27:54 UTC
From: Ivar Herfindal <ivar.herfindal@bio.ntnu.no>
Dear R-bug-people

I have encountered a problem with "unsplit", which I believe may be 
caused by a bug in the function. However, unexpericend with bug-reports 
I apologise if this is barely a user problem rather than a problem 
within R.

The problem occurs if an object is split by several grouping factors 
with levels not occuring in the data, and using drop = TRUE. This may 
appear as a special and hardly relevant case, but I had to split a data 
frame on several factors, do some analyses on each of the subsets in the 
splitted object, and then unsplit it. I had to use drop = TRUE, 
otherwise my analyses would not run. Nevertheless, I found a fix to the 
unsplit, which I suggest is due to that the drop-argument not is 
maintained in the call to unsplit within unsplit. Description and 
example below. The problem was found on R version 2.9.0 and 2.10.0 on 
windows XP.

 > sessionInfo()
R version 2.10.0 (2009-10-26)
i386-pc-mingw32

locale:
[1] LC_COLLATE=Norwegian (Bokmål)_Norway.1252 LC_CTYPE=Norwegian 
(Bokmål)_Norway.1252
[3] LC_MONETARY=Norwegian (Bokmål)_Norway.1252 LC_NUMERIC=C
[5] LC_TIME=Norwegian (Bokmål)_Norway.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] tools_2.10.0
 >
## a reproducable example:
dff <- data.frame(gr1=factor(c(1,1,1,1,1,2,2,2,2,2,2), 
levels=c(1,2,3,4)), gr2=factor(c(1,2,1,2,1,2,1,2,1,2,3), 
levels=c(1,2,3,4)), yy=rnorm(11))
# note that the two groups "gr1" and "gr2" have defined levels which not 
occur in the data.

dff2 <- split(dff, list(dff$gr1, dff$gr2), drop=TRUE)
# I dont want empty objects, so I use drop=TRUE

# now I want to unsplit it, and expect the following to work:
dff3 <- unsplit(dff2, list(dff$gr1, dff$gr2), drop=TRUE)
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1", "11", "3", 
"11", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1’, ‘11’, ‘3’, ‘5’

### end

Looking at the unsplit function, we find:
 > unsplit
function (value, f, drop = FALSE)
{
len <- length(if (is.list(f)) f[[1L]] else f)
if (is.data.frame(value[[1L]])) {
x <- value[[1L]][rep(NA, len), , drop = FALSE]
rownames(x) <- unsplit(lapply(value, rownames), f)
}
else x <- value[[1L]][rep(NA, len)]
split(x, f, drop = drop) <- value
x
}
<environment: namespace:base>
 >

Note that if "value" is a data.frame, then rownames for the output x is 
made by the call:
rownames(x) <- unsplit(lapply(value, rownames), f)

This call to unsplit ignores the drop-argument, and in the example above 
we get from this call:
 > unsplit(lapply(dff2, rownames), list(dff$gr1, dff$gr2))
[1] "1" "11" "3" "11" "5" "1" "7" "3" "9" "5" "11"

i.e. not unique row names for the output x.

A simple fix is to add drop = drop to that argument, such that the 
updated unsplit (here called unsplit2) is like this:

unsplit2 <- function (value, f, drop = FALSE)
{
len <- length(if (is.list(f)) f[[1L]] else f)
if (is.data.frame(value[[1L]])) {
x <- value[[1L]][rep(NA, len), , drop = FALSE]
rownames(x) <- unsplit(lapply(value, rownames), f, drop=drop) # note new 
"drop=drop"
}
else x <- value[[1L]][rep(NA, len)]
split(x, f, drop = drop) <- value
x
}

This works fine in the example above, and the original levels in gr1 and 
gr2 (i.e. they both have four levels) are maintained in the output data 
frame such that it has similar attributes as the orignial dff:

 > dff3 <- unsplit2(dff2, list(dff$gr1, dff$gr2), drop=TRUE)
 > dff3
gr1 gr2 yy
1 1 1 2.13749771
2 1 2 -0.02166458
3 1 1 0.45960452
4 1 2 2.72074958
5 1 1 -0.17536995
6 2 2 -0.08909495
7 2 1 0.94260802
8 2 2 -0.09979505
9 2 1 1.22240834
10 2 2 -0.81710781
11 2 3 0.76071130
 >

I must admit that I have not the possiblity to check if such a quick-fix 
conflicts with other use of unsplit or on other types of data, but I 
cannot see that it should be a problem.

Sincerely

Ivar Herfindal
--------------------------------
Centre for Conservation Biology
Norwegian University for Science and Technology
N-7491 Trondheim, Norway

email: ivar.herfindal@bio.ntnu.no

Comment 1 Jitterbug compatibility account 2009-11-27 16:02:00 UTC
NOTES:
 Fixed in 2.10.0 patched
Comment 2 Jitterbug compatibility account 2009-11-27 16:02:54 UTC
Audit (from Jitterbug):
Fri Nov 27 10:02:53 2009	ripley	changed notes
Fri Nov 27 10:02:54 2009	ripley	moved from incoming to Misc-fixed