Bug 16098 - Windows doesn't handle high Unicode code points
Summary: Windows doesn't handle high Unicode code points
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R 3.1.2
Hardware: Other Windows 64-bit
: P5 enhancement
Assignee: Duncan Murdoch
URL:
: 17299 17305 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-12-04 21:27 UTC by Duncan Murdoch
Modified: 2017-07-07 12:57 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Duncan Murdoch 2014-12-04 21:27:36 UTC
On Windows,

 as.hexmode(utf8ToInt("\U1d4d0"))

returns

[1] "d4d0"

because the parser stores the 0x1d4d0 value in a wchar_t variable, which is only 16 bits, and it gets truncated.
Comment 1 Richard Cotton 2014-12-07 17:18:26 UTC
This behaviour is mentioned in r-lang 10.3.1:
http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Literal-constants

> \Unnnnnnnn \U{nnnnnnnn}
>     (where multibyte locales are supported and not on Windows, otherwise an error)

but not in ?Quotes, so an initial fix may be to simply update the documentation on that page.
Comment 2 Duncan Murdoch 2017-05-19 21:10:40 UTC
This is now fixed in R-devel.  It introduces a new type Rwchar_t, which is guaranteed to hold all Unicode code points, and code to work with surrogate pairs when they appear.  This is used on all systems, in case UTF-16 values slip in on other systems.  (The values used as surrogate pairs are not legal values in the UCS-2 or UCS-4 encoding, but it seems prudent to be able to handle them.)
Comment 3 Duncan Murdoch 2017-06-27 11:08:16 UTC
*** Bug 17299 has been marked as a duplicate of this bug. ***
Comment 4 Duncan Murdoch 2017-07-07 12:57:03 UTC
*** Bug 17305 has been marked as a duplicate of this bug. ***