Bug 14286 - textConnection slow on large UTF-8 strings containing many multibyte characters
textConnection slow on large UTF-8 strings containing many multibyte characters
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Low-level
R 2.12.0
ix86 (32-bit) Windows 32-bit
: P5 enhancement
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-06 16:30 UTC by Simon Carne
Modified: 2010-05-17 14:11 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Carne 2010-05-06 16:30:25 UTC
textConnection seems to take much longer than I would expect for long UTF-8 strings. For example 
> system.time(textConnection(intToUtf8(rep(1e3,1e5)),encoding = "UTF-8"))
returns 10.13 seconds (User) on my system. The intToUtf8 is not the problem, that takes just 0.02 seconds. The UTF8 is the problem, because if I replace "rep(1e3,1e5)" with "rep(1e2,1e5)", so that all the characters are normal ASCII, the operation only takes 0.01 seconds.

sessionInfo():
-- CUT HERE --
R version 2.12.0 Under development (unstable) (2010-04-30 r51867) 
i386-pc-mingw32 

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
-- CUT HERE --
Comment 1 Brian Ripley 2010-05-17 14:11:50 UTC
It is more that the string is very long, not what textConnection
was designed for.  It's impossible to guess what the real application
is, or whether 10 secs is really slow.

Anyway, an unnecessary translation was being done, so it will be much
faster in 2.11.1.