Bug 15625 - read.table: Incorrect handling of character nuls over the first 5 lines
read.table: Incorrect handling of character nuls over the first 5 lines
Status: CLOSED WONTFIX
Product: R
Classification: Unclassified
Component: Accuracy
R 3.0.2
ix86 (32-bit) Windows 32-bit
: P5 normal
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-01-02 10:35 UTC by Matthieu Petiteville
Modified: 2014-01-28 14:34 UTC (History)
0 users

See Also:


Attachments
Test csv file (63 bytes, application/vnd.ms-excel)
2014-01-02 10:35 UTC, Matthieu Petiteville
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthieu Petiteville 2014-01-02 10:35:07 UTC
Created attachment 1546 [details]
Test csv file

Happy new year everyone,

I would like to report what seems to me like a bug in the read.table function, which in return affects the read.csv function as well.

I have a csv file which unfortunately contains a column full of \x00 values.
While the scan function seems to handle that case correctly, i.e. returning an empty value for those data, it seems that the C command in read.table used to check for col width over the first 5 lines reads them incorrectly and interprets them as an end of line command.

The result is that the first 5 lines (if there's no headers) are cut at the place of the \x00, while the next ones which are read using scan directly are not.

Here's a quick example :

My csv file : test1.csv

ColA,ColB,ColC
a,\x00,1
b,\x00,1
c,\x00,1
d,\x00,1
e,\x00,1

read.csv(test.csv) returns :

  ï..ColA ColB ColC
1       a   NA   NA
2       b   NA   NA
3       c   NA   NA
4       d   NA   NA
5       e   NA    1
6       f   NA    1


This is probably very dependant on the encoding of the \x00 value, which I unfortunately cannot guarantee.

I have already done about every combination of quote, file.encoding and separators.

I doubt that changing the na.strings parameter would change the behavior since the C function C_readtablehead doesn't take na.strings as argument.

Any help is welcome,

Best regards

Matthieu Petiteville
Comment 1 Brian Ripley 2014-01-07 11:49:07 UTC
I think the bug is that this works at all, even for scan().

An actual reproducible example would have helped: I presume you meant

read.csv("test1.csv")

and with that I do not get what you show.  Also, the file you attach and the listing you give are different.

AFAICS the issue is embedded nuls, and not "\x00" values.  Embedded nuls have not been supported in R for many years.  I have added a warning for this case.
Comment 2 Brian Ripley 2014-01-08 10:16:28 UTC
Two further comments:

1) this file has a BOM, so should have been read with

fileEncoding = "UTF-8-BOM"

unless in a UTF-8 locale on a platform which skips BOMs (e.g. OS X).

2) The 'skipNul' argument available in R-devel (3.1.0-to-be) will help in this case.
Comment 3 Matthieu Petiteville 2014-01-28 14:34:10 UTC
Hello Brian

Thank you for looking into it.

I have tried with the fileEncoding = "UTF-8-BOM" and still get the issue, but the headers are correct :

  ColA ColB ColC
1    a   NA   NA
2    b   NA   NA
3    c   NA   NA
4    d   NA   NA
5    e   NA    1
6    f   NA    1

Sorry for not being cristal clear, I was trying to present the problem with as much information as I could.
When opening the test1.csv joined on my first message with NPP, the values on ColB appear as "NUL", but when searching for \x00 value, NPP search select those "NUL" values, which is what lead me to believe they were indeed \x00.
I am no expert in encoding so I might be completely wrong.

If you do not see this issue when using read.csv(test1.csv) on your side, I'm thinking this could be linked to a local set up as well.

Anyway, thank you for your time, I will be looking for the skipNul parameter in next version to see if it does any good to my issue.

Best regards