Bug 16737 - File connections write UTF-16 incorrectly in Windows
Summary: File connections write UTF-16 incorrectly in Windows
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: I/O (show other bugs)
Version: R 3.2.3
Hardware: Other Other
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-02-29 18:46 UTC by Duncan Murdoch
Modified: 2016-02-29 18:46 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Duncan Murdoch 2016-02-29 18:46:05 UTC
This bug report follows discussion in the R-devel mailing list starting with this post:

https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html

In summary:  writing a file using 

x <- data.frame(a = I("a \" quote"), b = pi)
write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")

produced bad results.  Prior to R-devel revision 70247 there was an issue with
strings being truncated at null bytes, but that has been fixed.  However, there
are still problems on Windows, because it will insert single byte \r characters
as it writes the file in text mode, leading to an invalid file.

There appear to be two approaches for a solution on Windows:  First, we could tell Windows the encoding as part of the mode argument when the output file was opened.  Then it would insert the correct two-byte \r character.

An alternative requires a bigger change, but I think would be better:  we could handle the \r insertions ourselves, rather than telling Windows to do it.  This would have the advantage that we would not be restricted to the limited set
of encodings that Windows text mode can handle (UNICODE, UTF-8, and UTF-16LE).  If all text file handling were done in R, we could make it easier for both Unix and Windows to write text files with line-endings in either format.  This would require adding an "eol" argument to a number of functions, e.g. to file().