Bug 17120 - path.expand() produces unexpected result with UTF-8 encoded strings on Unix
Summary: path.expand() produces unexpected result with UTF-8 encoded strings on Unix
Status: REOPENED
Alias: None
Product: R
Classification: Unclassified
Component: System-specific (show other bugs)
Version: R-devel (trunk)
Hardware: Other Other
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-07-15 18:36 UTC by Kevin Ushey
Modified: 2017-05-24 14:26 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Ushey 2016-07-15 18:36:44 UTC
E.g.

> path <- "~/鬼.R"
> Encoding(path)
[1] "UTF-8"
> path.expand(path)
[1] "C:\\Users\\Kevin/<U+9B3C>.R"
> normalizePath(path)
[1] "C:\\Users\\Kevin\\<U+9B3C>.R"
Warning message:
In normalizePath(path.expand(path), winslash, mustWork) :
  path[1]="C:/Users/Kevin/<U+9B3C>.R": The filename, directory name, or volume label syntax is incorrect

Note that the UTF-8 character '鬼' was expanded into a text representation of its code point. Is this the expected behavior?

For what it's worth, marking the encoding as 'unknown', applying these transformations, and then re-marking the encoding produces the expected result.
Comment 1 Duncan Murdoch 2016-07-15 19:44:45 UTC
Your title isn't quite right:  there are no "UTF-8" paths on Windows, there are "UTF-16" paths.  Those are different encodings for the same set of Unicode characters.  If Windows had a UTF-8 code page, everything would just work, but it doesn't, so R converts from UTF-16 to the local code page (for me that's CP-1252 which is nearly equivalent to Latin1, not sure which one you're using), and characters that don't exist in that code page are written as escapes.

One solution to this would be to use UTF-8 internally in R even when it doesn't match the local code page, and skip the transition through the local 8 bit encoding.  That wasn't practical when Brian Ripley added the internationalization many years ago, and it would be a major effort to do it now.  So far nobody has volunteered to do the work.  Another solution would be for Microsoft to add a UTF-8 code page to Windows.  

The simplest workaround for you would be to stop using Windows.
Comment 2 Duncan Murdoch 2016-07-15 19:52:33 UTC
Actually it turns out that Windows does have a UTF-8 code page, it's CP 65001.  I don't have a modern Windows system at hand right now to try it out, but you can if you like.
Comment 3 Kevin Ushey 2016-07-15 23:05:33 UTC
Hi Duncan,

Thanks for the response -- indeed, I'm using an English locale (CP1252); sorry for not posting that in my initial report. Unfortunately, I didn't have much luck trying to use a UTF-8 code page.

I agree that attempting to use UTF-8 internally in R wherever possible (rather than the system encoding) would be preferred, but understand that would be a very large undertaking. (Perhaps it would be worth proposing as a project for the R Consortium?)

For what it's worth, the issue with 'path.expand()' could potentially be resolved by avoiding the translation to the system encoding. Ie, currently paths are expanded as (from https://github.com/wch/r-source/blob/trunk/src/main/platform.c#L1839-L1860):

    R_ExpandFileName(translateChar(...))

where 'translateChar()' attempts to translate from some marked encoding to the system / native encoding. Unfortunately, that translation fails with UTF-8 characters not representable in the active locale.

It seems like this translation could potentially be avoided, since all we really want to do in 'path.expand()' is substitute a leading '~' with the value of 'home', which is pulled out of some environment variable:

https://github.com/wch/r-source/blob/trunk/src/gnuwin32/sys-win32.c#L67-L102

(although perhaps the paths encoded in those environment variables need to be assumed to be in the system encoding, and so converted to UTF-8 before concatenating with the original path)

This seems like it could be a simple, safe, localized change -- I'll see if I can produce a patch that accomplishes this.

---

> sessionInfo()
R version 3.3.1 Patched (2016-06-27 r70840)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252
Comment 4 Duncan Murdoch 2017-05-23 10:54:14 UTC
This is now fixed in R-devel (as of r72719).  The fix is for Windows only; I don't know if other non-UTF8 locales should get something similar.
Comment 5 Duncan Murdoch 2017-05-24 13:55:38 UTC
This problem probably still exists in Unix.  (The behaviour exists when running R in a C locale at least, and probably other non-UTF8 locales.)  

My fix applied to Windows only, so I've re-opened the bug.  I don't know what the appropriate fix would be on Unix:  it may depend on the user's locale.

I'll condition the reversion test to run on Windows only for now.
Comment 6 Duncan Murdoch 2017-05-24 14:26:02 UTC
Changed title to reflect current status.