Created attachment 1901 [details]
The attached patch for "R Under development (unstable) (2015-08-31 r69232)" improves the speed of 'str(x)' (str.default) when 'x' is a character vector with long strings. The improvement is due to calling encodeString() on substrings instead of whole strings. There is a small performance penalty for short strings. Some of it could be avoided if encodeString() reliably handled non-printable characters on all platforms and locales, but that may be a difficult goal to achieve. One of the problem characters on the test platform is 'intToUtf8(8203)', the zero-width space.
The patch is not an improvement for some (possibly rare) cases where 'encodeString(x)' passes non-printable characters through unchanged, and 'nchar(encodeString(x), type = "width")' considers some of the characters to have zero or negative width, which causes the result to appear narrower than required. As a fallback, the patched code will then discard the encodeString() result of the substring and encode the whole string instead, like the existing implementation.
The patch also has a small change to the code that determines the number of elements to be shown: the number of characters (type = "width") in a double quoted encoded string is counted as 2 (the quotes) or more. Normally this makes no difference, but the existing implementation may choose to display a larger number of elements when nchar() gets confused by non-printable characters and returns a number smaller than 2.
A patched R including the new str.R (referred to as str2.R in the patch file) passes all tests in 'make check'.
R Under development (unstable) (2015-08-31 r69232)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
 LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
 LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
 LC_PAPER=en_US.UTF-8 LC_NAME=C
 LC_ADDRESS=C LC_TELEPHONE=C
 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
 compiler stats graphics grDevices utils datasets methods
other attached packages:
 microbenchmark_1.4-2 multcomp_1.4-1 TH.data_1.0-6
 survival_2.38-3 mvtnorm_1.0-3
loaded via a namespace (and not attached):
 Rcpp_0.12.0 lattice_0.20-33 codetools_0.2-14 zoo_1.7-12
 digest_0.6.8 MASS_7.3-44 grid_3.3.0 plyr_1.8.3
 gtable_0.1.2 magrittr_1.5 scales_0.3.0 ggplot2_1.0.1
 stringi_0.5-5 reshape2_1.4.1 sandwich_2.3-3 proto_0.3-10
 splines_3.3.0 tools_3.3.0 stringr_1.0.0 munsell_0.4.2
Created attachment 1902 [details]
Benchmark the original str.default() vs the patched version and plot the results. The code can load precomputed results from an .RData file.
Created attachment 1903 [details]
Precomputed test results (i5-3470 CPU)
This file is compatible with the previously attached test code.
Created attachment 1904 [details]
Boxplot by test code (i5-3470 CPU)
The results are presented as comparable pairs, reading from left to right: original (input 1), patched (input 1), original (input 2), patched (input 2), etc. The clarity suffers because of the wide scale in the results.
The left half shows results for inputs with 1 string. The inputs used in the right half have 26 (identical) strings. The number of characters is increasing from left to right, in each half: 1, 26, 2600, 260000, 26000000.
As an extreme example, the median (100 repeats) runtime in the 26 * 26000000 case decreases from 278 seconds to less than 1 second.
Created attachment 1905 [details]
Violin plot by test code (i5-3470)
This picture presents a subset of the results, concentrating on the cases where the patched str.default() is slower than the original. The increase percentages are median vs median. As the timings are in the sub-millisecond scale, the performance loss is quite insignificant.
Created attachment 1906 [details]
Updated test code
Previous version was missing 'library(vioplot)'.
Created attachment 1907 [details]
The original patch had a mistake in the branch that handles the case where the "vec.len" argument has been used.
Created attachment 1908 [details]
Boxplot with explicit vec.len=4 (i5-3470)
str.default() detects if the "vec.len" argument has been specified. This plot was done by modifying the previously attached test code so that 'vec.len = 4' is used in each function call compared by microbenchmark(). As a highlight, the median runtime of the 26 * 26000000 case dropped from 109 to 0.13 seconds, original vs. patched.
The options affecting str.default() were at their default values when running these and previous tests:
Created attachment 1909 [details]
Precomputed test results, vec.len=4 (i5-3470)
The results corresponding to the previous boxplot, with an explicit 'vec.len = 4' used in the calls to str.default().
Moving from the original str.default() to the patched version, we see the following relative changes in the median runtime when looking at the input sizes selected for the earlier violin plot: +9%, +9%, +10%, -25%. When the patched version is slower, it is not too much slower.
Thank you for your careful proposal including checks etc.
We are first looking into the other bogous behavior, you've mentioned about
'nchar(intToUtf8(8203), "width")' to be negative.
That may (or may not) simplify the patch needed for the speedup of str.default().
Thank you indeed,
(In reply to Martin Maechler from comment #9)
> Thank you for your careful proposal including checks etc.
> We are first looking into the other bogous behavior, you've mentioned about
> 'nchar(intToUtf8(8203), "width")' to be negative.
> That may (or may not) simplify the patch needed for the speedup of
> Thank you indeed,
Time has passed and the nchar() problem has indeed been solved a while ago.
I still keep a version of your patch ("pretty edited", some extra comments, also in line with the small str() changes since...) in my sources, and I would like to close this by really implementing such a speedup for such large strings.
In the mean time, str() and R too has improved a bit, but current str.default()
is still very slow for large strings and much faster after your patch.
I'll attach my current version of test_str.R and of str.default.2.R
so you (and other readers) can work with it.
Currently I'd hope for a much more concise patch with basically the same effect,
as it is a bit problematic to add much R code just for dealing with this rare extreme situation.
Created attachment 2099 [details]
Zip of str.default.2.R, test_str.R & str_bench.rds
Updated 'test_str.R' (leave away the longest lasting case)
updated str.default.2.R ( = str.default *my* version)
precomputed microbenchmarks 'str_bench.rds'
Created attachment 2100 [details]
Trivial changes to str.default.2.R
This attachment is a patch to make str.default.2.R a tiny bit more concise (removing an unnecessary assignment, etc.), but I am not able to come up with more fundamental changes at this point.
It's nice that the negative width issue has been solved. However, it is probably impossible to avoid checking the display width, i.e. nchar(type = "w"), of the encoded string, and then possibly resorting to encoding the whole string. A small scale (but extreme) example:
> multi_diacritics <- paste0("a", intToUtf8(768:783), "b")
> nchar(multi_diacritics, "width")
> nchar(substr(multi_diacritics, 1, 17), "width")
> encodeString(multi_diacritics) == multi_diacritics
Here we stack multiple diacritics on the letter "a", followed by the letter "b". Any proper non-empty prefix of 'multi_diacritics', up to substr(multi_diacritics, 1, 17), has a display width of 1. The same is true for the corresponding encoded version. In this case, the whole string is needed for a result with a display width of 2. The way encodeString() works here (it does nothing) seems correct because diacritics are printable characters.
I have now committed (to R-devel) a version of str.default() which *is* fast,
and seems to need less "slack" (could probably be decreased even more?), hopefully working fine also in the case of diacriticals or other mult-byte characters, by using 'strtrim(C, n)' instead of 'substr(C, 1, n)'.
r70698 | maechler | 2016-06-02 12:01:11 +0200 (Thu, 02 Jun 2016) | 1 line
str() on large strings is much faster now and string truncation is more accurate
My current plan is to also port it to 'R patched', possibly in time for R 3.3.1,
or then a bit later.
Thank you Mikko for the suggestions, code and test file proposals!
I'm closing the bug now, as the *speed* problem is solved.
If there are still problems with multibyte chars, I'd want you to open a new bug report.