Bug 14286 - textConnection slow on large UTF-8 strings containing many multibyte characters
Summary: textConnection slow on large UTF-8 strings containing many multibyte characters
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R 2.12.0
Hardware: ix86 (32-bit) Windows 32-bit
: P5 enhancement
Assignee: R-core
Depends on:
Reported: 2010-05-06 16:30 UTC by Simon Carne
Modified: 2010-05-17 14:11 UTC (History)
0 users

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Simon Carne 2010-05-06 16:30:25 UTC
textConnection seems to take much longer than I would expect for long UTF-8 strings. For example 
> system.time(textConnection(intToUtf8(rep(1e3,1e5)),encoding = "UTF-8"))
returns 10.13 seconds (User) on my system. The intToUtf8 is not the problem, that takes just 0.02 seconds. The UTF8 is the problem, because if I replace "rep(1e3,1e5)" with "rep(1e2,1e5)", so that all the characters are normal ASCII, the operation only takes 0.01 seconds.

-- CUT HERE --
R version 2.12.0 Under development (unstable) (2010-04-30 r51867) 

[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
-- CUT HERE --
Comment 1 Brian Ripley 2010-05-17 14:11:50 UTC
It is more that the string is very long, not what textConnection
was designed for.  It's impossible to guess what the real application
is, or whether 10 secs is really slow.

Anyway, an unnecessary translation was being done, so it will be much
faster in 2.11.1.