Bug 16738 - file.show() cannot handle null bytes (in UTF-16)
Summary: file.show() cannot handle null bytes (in UTF-16)
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: I/O (show other bugs)
Version: R 3.2.3
Hardware: Other Other
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-02-29 20:20 UTC by Mikko Korpela
Modified: 2016-02-29 20:20 UTC (History)
0 users

See Also:


Attachments
Suggested patch (798 bytes, patch)
2016-02-29 20:20 UTC, Mikko Korpela
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mikko Korpela 2016-02-29 20:20:42 UTC
Created attachment 2031 [details]
Suggested patch

This bug report is a follow-up to a thread on the R-devel mailing list: https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html

When a text file has null bytes, e.g. a UTF-16 file with ASCII code points, file.show() may fail to show it correctly.

As an example, the (system dependent) result of the following code is a pager showing "<66>" (quotes not included) followed by several empty lines.

  foobar <- charToRaw("foo\r\nbar\r\n")
  foobar_utf16 <- c(as.raw(c("0xff", "0xfe")), rbind(foobar, as.raw(0L)))
  filename <- tempfile()
  writeBin(foobar_utf16, filename)
  file.show(filename, encoding="UTF-16")

This was tested on a Linux computer running "R version 3.2.4 beta (2016-02-29 r70247)" and R-devel r70247.

With the suggested patch applied, the result is as expected: a pager showing the lines "foo" and "bar" (followed by an empty line). The following is an almost verbatim copy of what I wrote earlier on the mailing list, describing the patch.

The idea is to read the input file "raw" in order to avoid problems with null bytes. The input then needs to be split into lines after iconv(), or it could be written to the output file with cat() if the style of line termination characters does not matter. The 'perl = TRUE' is for assumed performance advantage only. It can be removed, or one might want to test if there is a significant difference one way or the other.

> sessionInfo()
R Under development (unstable) (2016-02-29 r70247)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base