Bug 15245 - Critical problem in the basic read.table() function.
Critical problem in the basic read.table() function.
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: I/O
R 2.14.2
x86_64/x64/amd64 (64-bit) Linux
: P5 critical
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-03-26 21:22 UTC by AlexR
Modified: 2013-04-07 21:00 UTC (History)
2 users (show)

See Also:


Attachments
An example table that causes an erroneous interpretation with read.table(). (812 bytes, text/plain)
2013-03-26 21:24 UTC, AlexR
Details

Note You need to log in before you can comment on or make changes to this bug.
Description AlexR 2013-03-26 21:22:22 UTC
Hello,

 It seems there is a critical problem in the basic read.table() function in R. When reading a rather simple table (for the file, see the attachment, the command is: read.table("1.xyz") ), which does not contain anything that might cause incompatibility, the function makes a table that contains additional duplicated lines. The function does not show any error messages or warnings whatsoever. In the attached example, the file contains 18 rows. However, the read-in version shows 22 lines with the lines 6,7,8,9 from the actual table in the file additionally repeated in the beginning. This can potentially cause lots of problems in major workflows. I have reproduced this problem in both my R installations (2.14.2 and 2.10.1) running under Linux Ubuntu 10.04LTS.

Thanks,
Alex
Comment 1 AlexR 2013-03-26 21:24:05 UTC
Created attachment 1423 [details]
An example table that causes an erroneous interpretation with read.table().
Comment 2 Duncan Murdoch 2013-03-26 21:38:39 UTC
I'm not sure what a correct interpretation of that file should be, so I'm not convinced it's a bug.  The file contains single quote characters, and by default those are treated as quotes for strings (see the "quote" argument).

If you use quote="" as an argument to tell it that the quotes are not to be interpreted as quotes, then it is read the way you'd expect.

I think an argument could be made that a warning should be issued on a file that makes such strange use of quotes as this one does (who would ever intend those quotes to be quoting strings containing newlines?), but I'm not sure there's a good solution to read this file by default.
Comment 3 Peter Dalgaard 2013-03-27 10:48:16 UTC
This is user error. Newlines between quotes are part of the string, and this is by design and as documented:

     Quoted fields with embedded newlines are supported except after a
     comment character.

Multiline strings do occur, e.g., when exporting data from spreadsheets.
Comment 4 AlexR 2013-03-30 22:17:15 UTC
I am sorry for not stating the case clearly. The content of the attached example file is the following:

 HO5'  H   -2.563739    1.873349   -0.485452
  O5'  O   -2.054881    1.175528   -0.064494
  C5'  C   -0.736414    1.182238   -0.613055
  H5'  H   -0.746711    1.081229   -1.704534
 H5''  H   -0.198745    2.105679   -0.357748
  C4'  C    0.000000    0.000000    0.000000
  H4'  H   -0.476764   -0.933574   -0.327309
  O4'  O    1.355460    0.043923   -0.435761
  C1'  C    2.233321   -0.325655    0.628134
  H1'  H    2.449028   -1.402756    0.605694
 H1''  H    3.166198    0.224176    0.497153
  C2'  C    1.503592    0.000000    1.945146
  H2'  H    1.775164    1.009485    2.269603
 H2''  H    1.740207   -0.683620    2.766584
  C3'  C    0.000000    0.000000    1.542778
  H3'  H   -0.522675    0.870464    1.943772
  O3'  O   -0.580674   -1.195428    2.072863
 HO3'  H   -1.536340   -1.091244    2.078183

and it is intended to be converted into a table as it looks, having 5 columns and 18 rows. The entries like H1'' are intended to be read as "H1''", those prime and double-prime notations are quite common in atom identifiers for molecular modelling. However, the read.table() function returns the following:

> read.table("1.xyz")
     V1 V2        V3        V4        V5
1   C4'  C  0.000000  0.000000  0.000000
2   H4'  H -0.476764 -0.933574 -0.327309
3   O4'  O  1.355460  0.043923 -0.435761
4   C1'  C  2.233321 -0.325655  0.628134
5  HO5'  H -2.563739  1.873349 -0.485452
6   O5'  O -2.054881  1.175528 -0.064494
7   C5'  C -0.736414  1.182238 -0.613055
8   H5'  H -0.746711  1.081229 -1.704534
9  H5''  H -0.198745  2.105679 -0.357748
10  C4'  C  0.000000  0.000000  0.000000
11  H4'  H -0.476764 -0.933574 -0.327309
12  O4'  O  1.355460  0.043923 -0.435761
13  C1'  C  2.233321 -0.325655  0.628134
14  H1'  H  2.449028 -1.402756  0.605694
15 H1''  H  3.166198  0.224176  0.497153
16  C2'  C  1.503592  0.000000  1.945146
17  H2'  H  1.775164  1.009485  2.269603
18 H2''  H  1.740207 -0.683620  2.766584
19  C3'  C  0.000000  0.000000  1.542778
20  H3'  H -0.522675  0.870464  1.943772
21  O3'  O -0.580674 -1.195428  2.072863
22 HO3'  H -1.536340 -1.091244  2.078183

The read-in version shows 22 lines with the lines 6,7,8,9 from the actual table in the file additionally repeated in the beginning. Please note, that the atom identifiers (V1 column) are still correctly read. There are just duplicated rows.

Thanks for the responses. Indeed, using quotes="" (Comment 2) has solved the situation, but honestly, reading several times the documentations about quotes and quotes in quotes does not really clarify (for the user) the way read.table() treats them. 

Since, in my scripts, the required files were created and looked fine from the first appearance, I spotted this after wasting months of supercomputer time running quantum mechanical calculations on wrong files. Perhaps some warning or clarification in the documentation might be helpful for others not to fall into the same trap in future.

Thanks a lot for the prompt attention,
Alex
Comment 5 AlexR 2013-03-30 22:26:20 UTC
Also, the attached table file is generated from R, so there are no hidden characters that might interfere with read.table(). And again, the individual entries in the first column are correctly represented by read.table(), there is just a duplication of several rows.
Comment 6 Peter Dalgaard 2013-03-31 00:48:56 UTC
As Duncan Murdoch also pointed out, there is at least one real bug in here. I actually though his notes had made it to the bug repo.

It's still user error to overlook the quote argument and have quotes in the file, but the quotes seem to be interpreted  inconsistently and this has created a rather extreme case of bad luck for the user.

Specifically, the internal readTableHead and scan() treat newlines between single quotes differently, but also scan()'s generic treatment of single quotes contain some surprises;


> scan(text="
+ 'a
+ '
+ b' ' c 'd  e'
+ ' f 
+ g '
+ ", what="")
Read 6 items
[1] "a\n"     "b'"      " c "     "d"       "e'"      " f \ng "
Comment 7 Duncan Murdoch 2013-04-07 18:57:27 UTC
It appears the problem is mainly in the readtablehead function, which uses different rules for handling quotes than scan() does.  After some additional testing I'll commit a fix.
Comment 8 Duncan Murdoch 2013-04-07 21:00:04 UTC
Should be fixed now in R-devel and soon in R-patched.