Bugzilla – Bug 15004
aggregate.data.frame fails due to comparison of double values
Last modified: 2014-07-22 23:06:09 UTC
Created attachment 1340 [details]
test program to reproduce the problem
Created attachment 1341 [details]
patch for the version R-devel 2012.7.25
Created attachment 1342 [details]
original file used to make patch
copy of src/library/stats/R/aggregate.R in R-devel_2012-07-25.tar.gz
Dear R-devel contributors,
I found a bug in aggregate.data.frame.
It internally uses a double vector to identify the combinations of factors, and then compared via match(). But sometimes it fails.
Suppose that there are two combinations of factors represented by double values x and y. If
x = 1e+100
y = 1e+100 + 1
match(x, y) returns 1 and they are treated as the same.
I attached the patch for this problem.
It first transforms the factors into factors and its integer value, and then converts to characters and combines then with paste(sep = ".").
This patch is written to apply the version R-devel 2012.7.25.
The original source code is src/library/stats/R/aggregate.R.
I don't think the patch will get the sort order right.
Created attachment 1343 [details]
patch to have the sort order right
(In reply to comment #4)
> I don't think the patch will get the sort order right.
Thank you for your comment.
I fixed the patch to have the right order.
In this patch, grp is represented like "001.012.3124.0005".
Created attachment 1344 [details]
patch (should be faster)
(In reply to comment #6)
> (In reply to comment #4)
> > I don't think the patch will get the sort order right.
> Thank you for your comment.
> I fixed the patch to have the right order.
> In this patch, grp is represented like "001.012.3124.0005".
I noticed that this representation is bad since (the subset of) grp is reused for sorting. I fixed to use only the order of the above representation.
There should be better ways to map each combination of factors into a integer.
I just post this patch to make it work right externally and hope that someone makes it better.
Created attachment 1345 [details]
better way to generate integer identifier
Finally I have found a better way.
In this patch grp is defined as the sort order of factors.
I am sorry the last one does not work with ties. I will fix it.
Created attachment 1346 [details]
patch using numbers-and-dots, corrected for ties
Created attachment 1347 [details]
patch with minimal changes
Thank you so much for your attention. I uploaded two patches, namely:
The details are as follows:
aggregate.patch.4 is the corrected version of numbers-and-dots representation.
It combines each combination of factors like "001.0234.56789" and takes its rank() to have an integer representation.
aggregate.patch.5 has the same strategy as the original one, but uses rank() to avoid explosion of grp values.
In addition I changed grp from double into integer to avoid comparison failure, and also changed as.factor() to factor(), to reduce nlevels(ind) if possible.
Both are not perfect, just quick fixes, but works fine even with the test code that the original version fails. I hope some fix are done to this problem, by any way.
Created attachment 1348 [details]
patch using numbers-and-dots, corrected again
added scientific = FALSE option to format()
aggregate.patch.4 has been revised. Added scientific = FALSE option to format().
I also tried a more direct implementation:
grp <- integer(nrx)
dic <- split(1:nrx, by, drop = TRUE)
for(i in 1:length(dic))
grp[dic[[i]]] <- i - 1
But it was too slow. It took over 40 seconds with a dataset having over 4,000 combinations of factors, while it took less than three seconds via the former two patches (about 10% faster than the original one in this case). Numbers-and-dots implementation (patch 4) was slightly faster than minimal one (patch 5).
Just for information the original function returned only about 1,600 combinations of factors due to the comparison problem. There was no error message or warning so it was hard to detect what was wrong.
Thank you for your attention and I am sorry for too many posts.
Why is Ryota's patch still not in the codebase? I stumbled on this bug and spend some time, try to figure why I'm losing rows in a data frame.
(In reply to comment #17)
> Why is Ryota's patch still not in the codebase? I stumbled on this bug and
> spend some time, try to figure why I'm losing rows in a data frame.
Which one? I counted about 10 different patches in his messages.
(In reply to comment #18)
> (In reply to comment #17)
> > Hey,
> > Why is Ryota's patch still not in the codebase? I stumbled on this bug and
> > spend some time, try to figure why I'm losing rows in a data frame.
> Which one? I counted about 10 different patches in his messages.
aggregate.patch.4 worked best when I wrote them. I will check this patch with current codebase.
# Thank you Alexey for finding this patch.
Created attachment 1492 [details]
patch for aggregate.R revision 63161
This is a patch for the current revision. I fixed the sort order to be identical to the existing implementation (rev() was needed to be identical to the current implementation).
I tested this code that:
* It returns the same result as the current version with small data
(Using identical() with MASS::Boston data set)
* It passes the test code I've already posted
(test.aggregate.data.frame.R, id = 1340)
I'll be committing code to fix this in R-devel and R-patched soon. It is based on Ryota's final patch, but with several changes:
- That code did not handle NA properly.
- I don't like calling .Internal from another package, even from one base package to another.
- "fun" is not a very informative name.
Thanks to Ryota for your persistence in working on this. If you do it again, please test against the output from running example() on the function; that's how I discovered the NA bug. (It's also a good idea to run the regression tests in the R source on it.)
Thank you Duncan, and Alexey too! I will pay more attention to testing next time.
*** Bug 15116 has been marked as a duplicate of this bug. ***