Bug 16918 - With near-equal numbers in 'by', aggregate.data.frame(drop=FALSE) gives extra row
Summary: With near-equal numbers in 'by', aggregate.data.frame(drop=FALSE) gives extra...
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Analyses (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-05-20 17:56 UTC by Suharto Anggono
Modified: 2017-05-25 20:31 UTC (History)
2 users (show)

See Also:


Attachments
Attempted fix (1.63 KB, patch)
2017-05-20 13:01 UTC, Suharto Anggono
Details | Diff
Attempted fix (1.53 KB, patch)
2017-05-20 18:10 UTC, Suharto Anggono
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Suharto Anggono 2016-05-20 17:56:42 UTC
This is an example.

R> group <- c(sqrt(2)^2, 2)
R> print(aggregate(data.frame(n = seq(group)), list(group = group), length,
R+ drop = FALSE), digits = 17)
               group n
1 2.0000000000000000 2
2 2.0000000000000004 0
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(la
bels,  :
  duplicated levels in factors are deprecated

With sqrt(2)^2 and 2 are considered equal, there is only one group with two members. So, in the result, row 2, with 0 in 'n', should not be there.

Compare with the following that uses default 'aggregate.data.frame' (drop=TRUE).

R> group <- c(sqrt(2)^2, 2)
R> print(aggregate(data.frame(n = seq(group)), list(group = group), length),
R+ digits = 17)
               group n
1 2.0000000000000004 2

R> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows XP (build 2600) Service Pack 2

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
Comment 1 Suharto Anggono 2016-05-20 23:07:54 UTC
(In reply to Suharto Anggono from comment #0)
> With sqrt(2)^2 and 2 are considered equal, there is only one group with two
> members. So, in the result, row 2, with 0 in 'n', should not be there.

Also compare with the following.

R> group <- c(sqrt(2)^2, 2)
R> print(aggregate(data.frame(n = seq(group)), list(group = as.factor(group)),
R+ length, drop = FALSE), digits = 17)
  group n
1     2 2
Comment 2 Suharto Anggono 2017-03-25 15:06:17 UTC
From NEWS: "..., duplicated factor levels now produce an error in levels<- instead of a warning, ...."
Because of that, the example in Description gives error in R 3.4.0 alpha.
Comment 3 Martin Maechler 2017-03-27 13:21:29 UTC
I acknowledge we have a bug... 
but I am *not* volunteering to fix it, as I have not been much of a user of
`aggregate.data.frame()`

=> tested patches are welcome notably if they are small.
Comment 4 Suharto Anggono 2017-05-01 10:52:08 UTC
A quick fix is changing
function(e) sort(unique(e))
to
function(e) e[match(sort(unique(as.factor(e))), as.factor(e))]
Comment 5 Michael Lawrence 2017-05-01 11:30:47 UTC
The question is how much precision to consider? It is odd that aggregate.data.frame() returns grouping columns that are non-factors, and it is strange to group based on real-valued data. as.factor() does as.character() which uses 15 significant digits (C99's DBL_BIG on machines with IEC60559 arithmetic). Should it just round numeric values to DBL_BIG?
Comment 6 Suharto Anggono 2017-05-20 13:01:14 UTC
Created attachment 2254 [details]
Attempted fix

This has the same theme with Comment 4.
With this, the result no longer has attribute "out.attrs" and character grouping variable no longer becomes a factor in the result (example 1 in https://stat.ethz.ch/pipermail/r-help/2016-May/438631.html).
Comment 7 Suharto Anggono 2017-05-20 18:10:49 UTC
Created attachment 2255 [details]
Attempted fix

This has the same theme with Comment 4.
The result still has attribute "out.attrs" and character grouping variable still becomes a factor in the result. If desired, call to 'expand.grid' can use KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE.
Comment 8 Martin Maechler 2017-05-23 14:19:53 UTC
(In reply to Suharto Anggono from comment #7)
> Created attachment 2255 [details]
> Attempted fix
> 
> This has the same theme with Comment 4.
> The result still has attribute "out.attrs" and character grouping variable
> still becomes a factor in the result. If desired, call to 'expand.grid' can
> use KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE.

I'm looking at the patch
_plus_ using   "KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE"  in 'expand.grid'
@Michael Lawrence ... you are right that factors would make more sense, but note that the default behavior ('drop = TRUE') has always returned strings instead of factors in this case, and that has been part of  aggregate.data.frame()s behavior for a very long time --- whereas the new option 'drop = FALSE' has been added only for R 3.3.0 (but already in r68963 | Sun, 09 Aug 2015).

So, for stability reasons it seems to make sense trying to _only_ change the drop=FALSE case and make that as similar as sensible to the old/default 'drop=TRUE'.

(Examples 2 and 3 in Suharto's R-help post https://stat.ethz.ch/pipermail/r-help/2016-May/438631.html  are not addressed yet, and maybe should, but _not_ in this PR).
After a 'make check-all' I plan to commit the patch (as mentioned above)
Comment 9 Suharto Anggono 2017-05-24 16:19:09 UTC
It turns out that current presentation of grouping variables in 'aggregate.data.frame' result was introduced in R 2.6.0.
NEWS item:
aggregate.data.frame() no longer changes the group variables into factors, and leaves alone the levels of those which are factors.  (Inter alia grants the wish of PR#9666.)
Comment 10 Martin Maechler 2017-05-25 20:31:19 UTC
Thank you for the background info: So this (character, not factor) behavior has been very much on purpose.

As I _have_ committed the fix (to R-devel for now), we can close the report now.
as mentioned, another report may be appropriate for the behavior you had observed in examples 2 and 3  in the R-help post in May 2016.