Bug 16064 - Writing UTF8 on Windows
Summary: Writing UTF8 on Windows
Status: CLOSED WONTFIX
Alias: None
Product: R
Classification: Unclassified
Component: I/O (show other bugs)
Version: R 3.1.2
Hardware: Other Windows 32-bit
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2014-11-07 21:51 UTC by Jeroen Ooms
Modified: 2017-07-25 10:24 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jeroen Ooms 2014-11-07 21:51:21 UTC
See this post [1] on r-devel for more details. I think the following two examples indicate bugs.

1) When writing a UTF8 string to a binary UTF8 connection on windows, it generates Latin1:

> string <- enc2utf8("Zürich")
> Encoding(string)
[1] "UTF-8"

> con <- file("test1.txt", open="wb", encoding = "UTF-8")
> writeLines(string, con)
> close(con)
> system("file test1.txt")
test1.txt: ISO-8859 text


2) Another probably related problem: when writing a UTF8 string to a UTF8 text  connection with useBytes=TRUE, the string seems to be re-coded one too many times resulting in invalid characters:

> con <- file("test3.txt", open="w", encoding = "UTF-8")
> writeLines(string, con, useBytes = TRUE)
> close(con)
> system("file test3.txt")
test3.txt: UTF-8 Unicode text, with CRLF line terminators
> readLines("test3.txt", encoding="UTF-8")
[1] "Zürich


[1] https://stat.ethz.ch/pipermail/r-devel/2014-October/069944.html
Comment 1 Duncan Murdoch 2017-05-22 14:26:16 UTC
The first is not a bug:  re-encoding is only documented to work on text mode connections, and it appears to work in one of those.  (This might not have been documented when the bug was first reported!)

The second is unclear.  You are asking the connection to translate the input into UTF-8, and you are telling writeLines that the string is already native.  So I'd say it might be reasonable to do what it did.  It's really a matter of priority, and it appears the request to translate had a higher priority than the request to leave things alone.

It's also arguable that the "useBytes" should have higher priority, but I'm not going to change this one.
Comment 2 Johannes Ranke 2017-07-25 10:24:19 UTC
In the first case, the desired result is obtained when using 

> writeLines(string, con, useBytes = TRUE)

In the second case, shouldn't the encoding (here UTF-8) be a property of the connection, so that writeLines can check if it is consistent with the native encoding or not?