Bug 16819 - parse() does not handle Unicode characters in input
Summary: parse() does not handle Unicode characters in input
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R 3.2.4
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-04-12 02:49 UTC by Pavel Minaev
Modified: 2016-04-13 10:51 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pavel Minaev 2016-04-12 02:49:48 UTC
Consider:

> parse(text = '"\\u0424"')[[1]]
[1] "Ф"

(this is Cyrillic letter "f")

This is the expected result. Now the same, but string literal contains the Unicode character directly, instead of a Unicode escape:

> parse(text = '"\u0424"')[[1]]
[1] "<U+0424>"

Expected result was the same as above, but instead it produced this weird escape sequence. 

Input string is UTF-8:

> Encoding('"\u0424"')
[1] "UTF-8"

so the expectation was that it would be parsed as such. However, it seems that in practice, the <>-escapes are produced for characters that cannot be mapped to the current native encoding. If locale is changed to e.g. Russian, then the result is correct:

> Sys.setlocale('LC_CTYPE', 'Russian_Russia.1251')
> parse(text = '"\u0424"')[[1]]
[1] "Ф"

Specifying encoding='UTF-8' in calls to parse() above doesn't change any of the results.

Overall, the expectation is that, if the input string to parse() is proper UTF-8, then it should be parsed as such, with no escaping taking place. As it is, it means that eval(parse(text = ...)) is unreliable if input can contain any characters not representable in the current native encoding.
Comment 1 Duncan Murdoch 2016-04-12 10:34:22 UTC
This is due to conversion to the native encoding.  It probably only affects Windows, which doesn't support UTF-8 as a native encoding.  It has been reported before in other contexts, e.g. in bug 16543.