Bug 16351 - R hangs when plotting invalid hexadecimal UTF-8 byte sequences
Summary: R hangs when plotting invalid hexadecimal UTF-8 byte sequences
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: Graphics (show other bugs)
Version: R 3.2.0
Hardware: x86_64/x64/amd64 (64-bit) Linux-Ubuntu
: P5 normal
Assignee: R-core
Depends on:
Reported: 2015-04-29 09:45 UTC by Sebastian Meyer
Modified: 2017-03-03 11:32 UTC (History)
1 user (show)

See Also:

Suggested patch to reEnc() (1.12 KB, patch)
2017-03-03 11:32 UTC, Mikko Korpela
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastian Meyer 2015-04-29 09:45:03 UTC
The following code causes my interactive R session to hang with 100% CPU usage:

plot(0, 0, type = "n")
text(0, 0, labels = "M\xc3a")

The same is true if I use the x11(type = "Xlib"), pdf(), or png() device.

The following warning is displayed _once_ before the session hangs:
Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()

Interestingly, the following plots work:

plot(0, 0, type = "n"); text(0, 0, labels = "M\xc3")

plot(0, 0, type = "n", xlab = "M\xc3a")

In these cases, the Pango warning stated above is displayed _twice_.

Of course, plotting works fine when using a valid hexadecimal UTF-8 byte sequence:

plot(0, 0, type = "n"); text(0, 0, labels = "M\xc3\xbc")


R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

 [1] LC_CTYPE=de_CH.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=de_CH.UTF-8        LC_COLLATE=de_CH.UTF-8    
 [7] LC_PAPER=de_CH.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
Comment 1 Mikko Korpela 2017-03-03 11:32:16 UTC
Created attachment 2233 [details]
Suggested patch to reEnc()

I modified reEnc() so that it doesn't always return the original string unchanged when asked for the theoretically trivial "UTF-8" to "UTF-8" conversion. Instead, the original string is only returned if it is validly encoded. Otherwise, invalid bytes are marked as in other conversions. The suggested patch is attached.

This seems to fix the issue. Obviously checking the validity of the string will take some time. I don't know of any other negative side effects, but I'm not really familiar with the internal-to-R or external (in extensions packages) use cases of reEnc().

> sessionInfo()
R Under development (unstable) (2017-03-02 r72298)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/mvkorpel/root_R-devel-r72298-readline-reEnc/lib/R/lib/libRblas.so
LAPACK: /home/mvkorpel/root_R-devel-r72298-readline-reEnc/lib/R/lib/libRlapack.so

 [1] LC_CTYPE=fi_FI.UTF-8    LC_NUMERIC=C            LC_TIME=en_GB          
 [4] LC_COLLATE=en_GB        LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_GB      
 [7] LC_PAPER=en_GB          LC_NAME=C               LC_ADDRESS=C           

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.0