Bug 17329 - translateCharUTF8 broken on Windows
Summary: translateCharUTF8 broken on Windows
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Windows GUI / Window specific (show other bugs)
Version: R 3.4.1
Hardware: Other Other
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2017-08-24 17:31 UTC by Patrick Perry
Modified: 2017-11-28 09:29 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick Perry 2017-08-24 17:31:47 UTC
On Windows,

enc2utf8("ü")

yields "|".

It's telling that the UTF-16 representation of the input is 00 FC, and the
UTF-8 representation of the output is 7C.

I think that line sysutils.c line 1001:
 inbuf = ans; inb = strlen(inbuf);

 (https://github.com/wch/r-source/blob/trunk/src/main/sysutils.c#L1001)


Should be
 inbuf = ans; inb = LENGTH(x);

like the analogous line in do_iconv (https://github.com/wch/r-source/blob/trunk/src/main/sysutils.c#L680 ).
Comment 1 Patrick Perry 2017-08-24 17:32:41 UTC
More info: https://github.com/juliasilge/tidytext/issues/80
Comment 2 Patrick Perry 2017-08-24 21:41:29 UTC
Even more info:

https://github.com/patperry/r-corpus/issues/5


And a work-around implementation for translateCharUTF8:

https://github.com/patperry/r-corpus/blob/master/src/utf8.c#L755
Comment 3 Martin Maechler 2017-11-11 16:27:07 UTC
I'm not using Windows, but I believe you that there may be a bug.
You did not give your  `sessionInfo()`  with the locale setting.

Also, the proposed code change is not in a Windows-only place, so you shouldn't it be possible to get bugous behavior platform independently, say in a latin1 Linux locale?

Further,  LENGTH(inbuf) is wrong "of course" because inbuf is `*char`;
but maybe  LENGTH(x)  would be correct as inbuf = ans, and ans = CHAR(x).

But really, the "pattern" with 'strlen()' we have appears five times in the file sysutils.c   

translateToNative(*ans, *cbuff, ...) :
    875:top_of_loop:
       :    inbuf = ans; inb = strlen(inbuf);
translateCharUTF8(SEXP x) :
   1004:top_of_loop:
       :    inbuf = ans; inb = strlen(inbuf);
wtransChar(SEXP x) :
   1102:top_of_loop:
       :    inbuf = ans; inb = strlen(inbuf);
reEnc(char *x, ce_in, ce_out, subst) :
   1207:top_of_loop:
       :    inbuf = x; inb = strlen(inbuf);
reEnc2(char *x, char *y, ......) :
   1313:top_of_loop:
       :    inbuf = x; inb = strlen(inbuf);


whereas the pattern with LENGTH(.) only once...

but I'm babbling along.  I'm not the expert to understand these character translations.... and I cannot easily fix a problem that is only seen on Windows.
Comment 4 Tomas Kalibera 2017-11-12 09:28:47 UTC
The bug report is unfortunately rather incomplete, it would be useful to have output from sessionInfo() (to know the native encoding used, operating system) and to know which frontend was used and how it was run. Furthermore, a bug report should clearly describe what would be the expected behavior and why. Looking at the source code presently in sysutils.c, I think the use of "strlen" is correct - the input string cannot include embedded zeros (note, it cannot be in UTF-16LE). I cannot reproduce the problem on my system, but that may be not surprising as I may have a different setup. Furthermore, the linked discussion seems to suggest that this has been attributed to a bug in RStudio and that bug has been fixed, so I am closing this and please get back with a detailed report if there was still a problem.
Comment 5 Tomas Kalibera 2017-11-13 12:31:11 UTC
Alright, after reading the related discussions I think there is a bug, possibly the one you meant or related. Reproducible in RGui on Windows 10 with code page 1252. The Euro sign has code 0x80 in cp1252, but this code maps to a non-printable character in iso-latin-1. When converted by iconv to the current locale, the Euro sign is marked as encoded in "latin1" and hence becomes unprintable.

> x <- rawToChar(as.raw(0x80))
> Encoding(x)
[1] "unknown" # 
> x
[1] "€" # x has declared native encoding and is printed fine 

> xx <- iconv(x, "cp1252", "") # should convert to current locale
> Encoding(xx)
[1] "latin1" # encoding marked as "latin1"
> xx
[1] "\u0080" # xx has declared "latin1" encoding and hence treated unprintable

The problem may be caused by that R on Windows internally uses latin1locale/known_to_be_latin1 as TRUE when the current code page is 1252 (in R_check_locale in platform.c).

> sessionInfo()
R Under development (unstable) (2017-11-13 r73718)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 15063)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.0
Comment 6 Tomas Kalibera 2017-11-28 09:29:19 UTC
The problem I have described has been addressed in 73783. If this did not resolve the originally reported issue, please file a new bug report (with the necessary information included).