Bug 15762 - readLines() corrupts UTF-8 characters on Windows with open connections
Summary: readLines() corrupts UTF-8 characters on Windows with open connections
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: I/O (show other bugs)
Version: R 3.1.0
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2014-04-19 09:55 UTC by Milan Bouchet-Valat
Modified: 2014-04-21 19:33 UTC (History)
0 users

See Also:


Attachments
Russian UTF-8 text to reproduce the problem (21 bytes, text/plain)
2014-04-19 09:55 UTC, Milan Bouchet-Valat
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Milan Bouchet-Valat 2014-04-19 09:55:56 UTC
Created attachment 1588 [details]
Russian UTF-8 text to reproduce the problem

Trying to read characters which cannot be represented in the locale encoding on Windows fails when readLines() is passed and open connection. This is visible when reading the attached Russian UTF-8 text file on a Windows 2008 machine in the French CP1252 locale:

# Connection is not open when passed to readLines()
> readLines(file("W:/Bureau/russian_test.txt", encoding="UTF-8"))
[1] "Проверка"

# Connection is open due to open="r"
> readLines(file("W:/Bureau/russian_test.txt", encoding="UTF-8", open="r"))
[1] "?"
Warnings:
1: In readLines("W:/Bureau/russian_test.txt") :
  invalid input found in input connection 'W:/Bureau/russian_test.txt'
[approx translation from French]

The most important problem with this is that readLines() calls file(..., open="r") in the R code before calling Internal(). So by default setting the input encoding as an option and calling readLines() does not work for characters which cannot be represented in the locale encoding:
> options(encoding="UTF-8")
> readLines("W:/Bureau/russian_test.txt")
[1] "?"
Warnings:
1: In readLines("W:/Bureau/russian_test.txt") :
  invalid input found in input connection 'W:/Bureau/russian_test.txt'
[approx translation from French]


As spotted by Kurt Hornik, this is due to the fact that the C code for readLines() and scan() set con->UTF8out=TRUE only when con is not yet open.


> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
Comment 1 Milan Bouchet-Valat 2014-04-21 19:33:25 UTC
Actually, the corruption happens with any UTF-8 text file, disregarding the fact that the characters can be represented in the locale or not. So you can apply the code showed above to any UTF-8 text.