Bug 15697 - improve partial matching performance for large x / table
Summary: improve partial matching performance for large x / table
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R-devel (trunk)
Hardware: Other Linux
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2014-03-03 13:49 UTC by Martin Morgan
Modified: 2014-05-19 14:05 UTC (History)
0 users

See Also:


Attachments
use hashing more often in pmatch (2.75 KB, patch)
2014-03-03 13:49 UTC, Martin Morgan
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Morgan 2014-03-03 13:49:04 UTC
Created attachment 1575 [details]
use hashing more often in pmatch

Sub-setting a large data.frame by row name can be surprisingly slow

    table <- as.character(seq_len(485512))
    x <- sample(table, 10042)
    df <- data.frame(row.names=table)
    system.time(df[x,])

leading to

> system.time(df[x,])
   user  system elapsed
 18.797   0.000  18.837

This is because of partial matching

    system.time(i <- pmatch(x, table, duplicates.ok=TRUE))

> system.time(i <- pmatch(x, table, duplicates.ok=TRUE))
   user  system elapsed
 18.781   0.004  18.826

and an overly conservative decision to avoid hashing by using a polynomial algorithm in src/main/unique.c:1046. Concerns about memory use seems out-of-place.

$ svn diff ~/src/R-devel/src/main/unique.c
Index: /home/mtmorgan/src/R-devel/src/main/unique.c
===================================================================
--- /home/mtmorgan/src/R-devel/src/main/unique.c	(revision 65064)
+++ /home/mtmorgan/src/R-devel/src/main/unique.c	(working copy)
@@ -1043,7 +1043,7 @@
 	   since the tradeoff involves memory as well as time
 	   it is not really possible to optimize there.
 	 */
-	if((n_target > 100) && (10*n_input > n_target)) {
+	if(n_input > 100 || n_target > 100) {
 	    /* <FIXME>
 	       Currently no hashing when using bytes.
 	       </FIXME>

helps a lot

> system.time(df[x,])
   user  system elapsed
  0.048   0.004   0.051
> system.time(i <- pmatch(x, table, duplicates.ok=TRUE))
   user  system elapsed
  0.048   0.000   0.048

The attached patch hashes in the no_dups case, too. 'make check' passes without error. The bounds for hashing have been chosen arbitrarily.
Comment 1 Brian Ripley 2014-05-19 14:05:55 UTC
The concerns about memory use were most certainly not 'out of place' when this was implemented (in the days of 8MB RAM).

Changed for R-devel.