Bug 15004 - aggregate.data.frame fails due to comparison of double values
aggregate.data.frame fails due to comparison of double values
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Accuracy
R-devel (trunk)
All All
: P5 critical
Assigned To: R-core
: 15116 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-27 14:03 UTC by Ryota Suzuki
Modified: 2014-07-22 23:06 UTC (History)
5 users (show)

See Also:


Attachments
test program to reproduce the problem (333 bytes, application/octet-stream)
2012-07-27 14:03 UTC, Ryota Suzuki
Details
patch for the version R-devel 2012.7.25 (704 bytes, patch)
2012-07-27 14:04 UTC, Ryota Suzuki
Details | Diff
original file used to make patch (7.72 KB, application/octet-stream)
2012-07-27 14:06 UTC, Ryota Suzuki
Details
patch to have the sort order right (833 bytes, patch)
2012-07-27 14:43 UTC, Ryota Suzuki
Details | Diff
patch (should be faster) (840 bytes, patch)
2012-07-27 14:52 UTC, Ryota Suzuki
Details | Diff
better way to generate integer identifier (689 bytes, patch)
2012-07-27 15:25 UTC, Ryota Suzuki
Details | Diff
patch using numbers-and-dots, corrected for ties (832 bytes, patch)
2012-07-27 16:48 UTC, Ryota Suzuki
Details | Diff
patch with minimal changes (186 bytes, patch)
2012-07-27 16:49 UTC, Ryota Suzuki
Details | Diff
patch using numbers-and-dots, corrected again (852 bytes, patch)
2012-07-27 18:07 UTC, Ryota Suzuki
Details | Diff
patch for aggregate.R revision 63161 (1.28 KB, patch)
2013-09-30 07:38 UTC, Ryota Suzuki
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ryota Suzuki 2012-07-27 14:03:00 UTC
Created attachment 1340 [details]
test program to reproduce the problem
Comment 1 Ryota Suzuki 2012-07-27 14:04:03 UTC
Created attachment 1341 [details]
patch for the version R-devel 2012.7.25
Comment 2 Ryota Suzuki 2012-07-27 14:06:40 UTC
Created attachment 1342 [details]
original file used to make patch

copy of src/library/stats/R/aggregate.R in R-devel_2012-07-25.tar.gz
Comment 3 Ryota Suzuki 2012-07-27 14:18:59 UTC
Dear R-devel contributors,

I found a bug in aggregate.data.frame.

It internally uses a double vector to identify the combinations of factors, and then compared via match(). But sometimes it fails.

Suppose that there are two combinations of factors represented by double values x and y. If

x = 1e+100
y = 1e+100 + 1

match(x, y) returns 1 and they are treated as the same.

I attached the patch for this problem.
It first transforms the factors into factors and its integer value, and then converts to characters and combines then with paste(sep = ".").

This patch is written to apply the version R-devel 2012.7.25.
The original source code is src/library/stats/R/aggregate.R.

Best Regards,
Ryota Suzuki
Comment 4 Peter Dalgaard 2012-07-27 14:24:12 UTC
I don't think the patch will get the sort order right.
Comment 5 Ryota Suzuki 2012-07-27 14:43:14 UTC
Created attachment 1343 [details]
patch to have the sort order right
Comment 6 Ryota Suzuki 2012-07-27 14:44:44 UTC
(In reply to comment #4)
> I don't think the patch will get the sort order right.

Thank you for your comment.
I fixed the patch to have the right order.

In this patch, grp is represented like "001.012.3124.0005".
Comment 7 Ryota Suzuki 2012-07-27 14:52:30 UTC
Created attachment 1344 [details]
patch (should be faster)
Comment 8 Ryota Suzuki 2012-07-27 14:59:10 UTC
(In reply to comment #6)
> (In reply to comment #4)
> > I don't think the patch will get the sort order right.
> 
> Thank you for your comment.
> I fixed the patch to have the right order.
> 
> In this patch, grp is represented like "001.012.3124.0005".

I noticed that this representation is bad since (the subset of) grp is reused for sorting. I fixed to use only the order of the above representation.

There should be better ways to map each combination of factors into a integer.
I just post this patch to make it work right externally and hope that someone makes it better.
Comment 9 Ryota Suzuki 2012-07-27 15:25:10 UTC
Created attachment 1345 [details]
better way to generate integer identifier

Finally I have found a better way.
In this patch grp is defined as the sort order of factors.
Comment 10 Ryota Suzuki 2012-07-27 15:32:21 UTC
I am sorry the last one does not work with ties. I will fix it.
Comment 11 Ryota Suzuki 2012-07-27 16:48:02 UTC
Created attachment 1346 [details]
patch using numbers-and-dots, corrected for ties
Comment 12 Ryota Suzuki 2012-07-27 16:49:08 UTC
Created attachment 1347 [details]
patch with minimal changes
Comment 13 Ryota Suzuki 2012-07-27 17:09:40 UTC
Thank you so much for your attention. I uploaded two patches, namely:

* aggregate.patch.4
* aggregate.patch.5

The details are as follows:

aggregate.patch.4 is the corrected version of numbers-and-dots representation.
It combines each combination of factors like "001.0234.56789" and takes its rank() to have an integer representation.

aggregate.patch.5 has the same strategy as the original one, but uses rank() to avoid explosion of grp values.
In addition I changed grp from double into integer to avoid comparison failure, and also changed as.factor() to factor(), to reduce nlevels(ind) if possible.

Both are not perfect, just quick fixes, but works fine even with the test code that the original version fails. I hope some fix are done to this problem, by any way.
Comment 14 Ryota Suzuki 2012-07-27 18:07:40 UTC
Created attachment 1348 [details]
patch using numbers-and-dots, corrected again

added scientific = FALSE option to format()
Comment 15 Ryota Suzuki 2012-07-27 18:09:05 UTC
aggregate.patch.4 has been revised. Added scientific = FALSE option to format().
Comment 16 Ryota Suzuki 2012-07-27 23:18:45 UTC
I also tried a more direct implementation:

    grp <- integer(nrx)
    dic <- split(1:nrx, by, drop = TRUE)
    for(i in 1:length(dic))
        grp[dic[[i]]] <- i - 1

But it was too slow. It took over 40 seconds with a dataset having over 4,000 combinations of factors, while it took less than three seconds via the former two patches (about 10% faster than the original one in this case). Numbers-and-dots implementation (patch 4) was slightly faster than minimal one (patch 5).

Just for information the original function returned only about 1,600 combinations of factors due to the comparison problem. There was no error message or warning so it was hard to detect what was wrong.

Thank you for your attention and I am sorry for too many posts.
Comment 17 Alexey Sergushichev 2013-09-29 05:24:56 UTC
Hey,

Why is Ryota's patch still not in the codebase? I stumbled on this bug and spend some time, try to figure why I'm losing rows in a data frame.
Comment 18 Duncan Murdoch 2013-09-29 13:29:01 UTC
(In reply to comment #17)
> Hey,
> 
> Why is Ryota's patch still not in the codebase? I stumbled on this bug and
> spend some time, try to figure why I'm losing rows in a data frame.

Which one?  I counted about 10 different patches in his messages.
Comment 19 Ryota Suzuki 2013-09-30 04:50:21 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > Hey,
> > 
> > Why is Ryota's patch still not in the codebase? I stumbled on this bug and
> > spend some time, try to figure why I'm losing rows in a data frame.
> 
> Which one?  I counted about 10 different patches in his messages.

aggregate.patch.4 worked best when I wrote them. I will check this patch with current codebase.

# Thank you Alexey for finding this patch.
Comment 20 Ryota Suzuki 2013-09-30 07:38:23 UTC
Created attachment 1492 [details]
patch for aggregate.R revision 63161

This is a patch for the current revision. I fixed the sort order to be identical to the existing implementation (rev() was needed to be identical to the current implementation).

I tested this code that:

* It returns the same result as the current version with small data
  (Using identical() with MASS::Boston data set)

* It passes the test code I've already posted
  (test.aggregate.data.frame.R, id = 1340)

Best Regards,
Ryota
Comment 21 Duncan Murdoch 2013-10-18 18:45:44 UTC
I'll be committing code to fix this in R-devel and R-patched soon.  It is based on Ryota's final patch, but with several changes:

 - That code did not handle NA properly.
 - I don't like calling .Internal from another package, even from one base package to another.
 - "fun" is not a very informative name.

Thanks to Ryota for your persistence in working on this.  If you do it again, please test against the output from running example() on the function; that's how I discovered the NA bug.  (It's also a good idea to run the regression tests in the R source on it.)
Comment 22 Ryota Suzuki 2013-10-21 12:43:02 UTC
Thank you Duncan, and Alexey too! I will pay more attention to testing next time.
Comment 23 Andrew Christianson 2014-07-22 23:06:09 UTC
*** Bug 15116 has been marked as a duplicate of this bug. ***