Bug 14262 - utf8ToInt slow!
utf8ToInt slow!
Status: RESOLVED FIXED
Product: R
Classification: Unclassified
Component: Low-level
R 2.11.0
ix86 (32-bit) Windows 32-bit
: P5 minor
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-04-19 15:25 UTC by Simon Carne
Modified: 2010-04-19 16:50 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Carne 2010-04-19 15:25:36 UTC
In R --vanilla (R 2.11.0 RC (2010-04-14 r51736)

a <- paste(rep("\u100",1e5),collapse = "")
system.time(utf8ToInt(a))

returns

   user  system elapsed 
   9.65    0.00    9.67

9 seconds on a modern computer to convert 100000 characters is a bit slow, isn't it? For comparison: intToUtf8 takes around 0.02 seconds to do the reverse operation.


sessionInfo() (not in --vanilla) returnsR version 2.11.0 RC (2010-04-14 r51736) 
i386-pc-mingw32 

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     


I am using Windows XP SP 2.
Comment 1 Simon Urbanek 2010-04-19 16:50:25 UTC
Addressed in r51773 for R-devel (too late for 2.11.0)

before:
> a <- paste(rep("\u100",1e5),collapse = "")
> system.time(utf8ToInt(a))
   user  system elapsed 
  0.470   0.005   0.474 

after:
> a <- paste(rep("\u100",1e5),collapse = "")
> system.time(utf8ToInt(a))
   user  system elapsed 
  0.001   0.000   0.000 

The slow-down was a function of the length of the string and the number of of multibyte characters (each triggering a strlen()).