Bug 16738

Summary: file.show() cannot handle null bytes (in UTF-16)
Product: R Reporter: Mikko Korpela <mvkorpel>
Component: I/OAssignee: R-core <R-core>
Status: UNCONFIRMED ---    
Severity: enhancement    
Priority: P5    
Version: R 3.2.3   
Hardware: Other   
OS: Other   
Attachments: Suggested patch

Description Mikko Korpela 2016-02-29 20:20:42 UTC
Created attachment 2031 [details]
Suggested patch

This bug report is a follow-up to a thread on the R-devel mailing list: https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html

When a text file has null bytes, e.g. a UTF-16 file with ASCII code points, file.show() may fail to show it correctly.

As an example, the (system dependent) result of the following code is a pager showing "<66>" (quotes not included) followed by several empty lines.

  foobar <- charToRaw("foo\r\nbar\r\n")
  foobar_utf16 <- c(as.raw(c("0xff", "0xfe")), rbind(foobar, as.raw(0L)))
  filename <- tempfile()
  writeBin(foobar_utf16, filename)
  file.show(filename, encoding="UTF-16")

This was tested on a Linux computer running "R version 3.2.4 beta (2016-02-29 r70247)" and R-devel r70247.

With the suggested patch applied, the result is as expected: a pager showing the lines "foo" and "bar" (followed by an empty line). The following is an almost verbatim copy of what I wrote earlier on the mailing list, describing the patch.

The idea is to read the input file "raw" in order to avoid problems with null bytes. The input then needs to be split into lines after iconv(), or it could be written to the output file with cat() if the style of line termination characters does not matter. The 'perl = TRUE' is for assumed performance advantage only. It can be removed, or one might want to test if there is a significant difference one way or the other.

> sessionInfo()
R Under development (unstable) (2016-02-29 r70247)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base