Bugzilla – Bug 15245
Critical problem in the basic read.table() function.
Last modified: 2014-08-12 07:58:01 UTC
It seems there is a critical problem in the basic read.table() function in R. When reading a rather simple table (for the file, see the attachment, the command is: read.table("1.xyz") ), which does not contain anything that might cause incompatibility, the function makes a table that contains additional duplicated lines. The function does not show any error messages or warnings whatsoever. In the attached example, the file contains 18 rows. However, the read-in version shows 22 lines with the lines 6,7,8,9 from the actual table in the file additionally repeated in the beginning. This can potentially cause lots of problems in major workflows. I have reproduced this problem in both my R installations (2.14.2 and 2.10.1) running under Linux Ubuntu 10.04LTS.
Created attachment 1423 [details]
An example table that causes an erroneous interpretation with read.table().
I'm not sure what a correct interpretation of that file should be, so I'm not convinced it's a bug. The file contains single quote characters, and by default those are treated as quotes for strings (see the "quote" argument).
If you use quote="" as an argument to tell it that the quotes are not to be interpreted as quotes, then it is read the way you'd expect.
I think an argument could be made that a warning should be issued on a file that makes such strange use of quotes as this one does (who would ever intend those quotes to be quoting strings containing newlines?), but I'm not sure there's a good solution to read this file by default.
This is user error. Newlines between quotes are part of the string, and this is by design and as documented:
Quoted fields with embedded newlines are supported except after a
Multiline strings do occur, e.g., when exporting data from spreadsheets.
I am sorry for not stating the case clearly. The content of the attached example file is the following:
HO5' H -2.563739 1.873349 -0.485452
O5' O -2.054881 1.175528 -0.064494
C5' C -0.736414 1.182238 -0.613055
H5' H -0.746711 1.081229 -1.704534
H5'' H -0.198745 2.105679 -0.357748
C4' C 0.000000 0.000000 0.000000
H4' H -0.476764 -0.933574 -0.327309
O4' O 1.355460 0.043923 -0.435761
C1' C 2.233321 -0.325655 0.628134
H1' H 2.449028 -1.402756 0.605694
H1'' H 3.166198 0.224176 0.497153
C2' C 1.503592 0.000000 1.945146
H2' H 1.775164 1.009485 2.269603
H2'' H 1.740207 -0.683620 2.766584
C3' C 0.000000 0.000000 1.542778
H3' H -0.522675 0.870464 1.943772
O3' O -0.580674 -1.195428 2.072863
HO3' H -1.536340 -1.091244 2.078183
and it is intended to be converted into a table as it looks, having 5 columns and 18 rows. The entries like H1'' are intended to be read as "H1''", those prime and double-prime notations are quite common in atom identifiers for molecular modelling. However, the read.table() function returns the following:
V1 V2 V3 V4 V5
1 C4' C 0.000000 0.000000 0.000000
2 H4' H -0.476764 -0.933574 -0.327309
3 O4' O 1.355460 0.043923 -0.435761
4 C1' C 2.233321 -0.325655 0.628134
5 HO5' H -2.563739 1.873349 -0.485452
6 O5' O -2.054881 1.175528 -0.064494
7 C5' C -0.736414 1.182238 -0.613055
8 H5' H -0.746711 1.081229 -1.704534
9 H5'' H -0.198745 2.105679 -0.357748
10 C4' C 0.000000 0.000000 0.000000
11 H4' H -0.476764 -0.933574 -0.327309
12 O4' O 1.355460 0.043923 -0.435761
13 C1' C 2.233321 -0.325655 0.628134
14 H1' H 2.449028 -1.402756 0.605694
15 H1'' H 3.166198 0.224176 0.497153
16 C2' C 1.503592 0.000000 1.945146
17 H2' H 1.775164 1.009485 2.269603
18 H2'' H 1.740207 -0.683620 2.766584
19 C3' C 0.000000 0.000000 1.542778
20 H3' H -0.522675 0.870464 1.943772
21 O3' O -0.580674 -1.195428 2.072863
22 HO3' H -1.536340 -1.091244 2.078183
The read-in version shows 22 lines with the lines 6,7,8,9 from the actual table in the file additionally repeated in the beginning. Please note, that the atom identifiers (V1 column) are still correctly read. There are just duplicated rows.
Thanks for the responses. Indeed, using quotes="" (Comment 2) has solved the situation, but honestly, reading several times the documentations about quotes and quotes in quotes does not really clarify (for the user) the way read.table() treats them.
Since, in my scripts, the required files were created and looked fine from the first appearance, I spotted this after wasting months of supercomputer time running quantum mechanical calculations on wrong files. Perhaps some warning or clarification in the documentation might be helpful for others not to fall into the same trap in future.
Thanks a lot for the prompt attention,
Also, the attached table file is generated from R, so there are no hidden characters that might interfere with read.table(). And again, the individual entries in the first column are correctly represented by read.table(), there is just a duplication of several rows.
As Duncan Murdoch also pointed out, there is at least one real bug in here. I actually though his notes had made it to the bug repo.
It's still user error to overlook the quote argument and have quotes in the file, but the quotes seem to be interpreted inconsistently and this has created a rather extreme case of bad luck for the user.
Specifically, the internal readTableHead and scan() treat newlines between single quotes differently, but also scan()'s generic treatment of single quotes contain some surprises;
+ b' ' c 'd e'
+ ' f
+ g '
+ ", what="")
Read 6 items
 "a\n" "b'" " c " "d" "e'" " f \ng "
It appears the problem is mainly in the readtablehead function, which uses different rules for handling quotes than scan() does. After some additional testing I'll commit a fix.
Should be fixed now in R-devel and soon in R-patched.