Bug 14103 - read.csv confused by newline characters in header
read.csv confused by newline characters in header
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: I/O
old
ix86 (32-bit) Windows 32-bit
: P5 normal
Assigned To: Jitterbug compatibility account
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-02 15:50 UTC by Jitterbug compatibility account
Modified: 2009-12-04 18:39 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jitterbug compatibility account 2009-12-02 15:50:22 UTC
From: g.russell@eos-solutions.com
Full_Name: George Russell
Version: 2.10.0
OS: Microsoft Windows XP Service Pack 2
Submission from: (NULL) (217.111.3.131)


The following code (typed into R --vanilla)

testString <- '"B1\nB2"\n1\n'
con <- textConnection(testString)
tab <- read.csv(con,stringsAsFactors = FALSE)

produces a data frame with with one row and one column; the name of the column
is "B1.B2" (alright so far). However according to
print(tab[[1,1]])

the value of the entry in the first row and first column is

"B2\n1\n"

So B2 has somehow got into both the names of the data frame and its entry.
Either R is confused or I am. What is going on?

Comment 1 Jitterbug compatibility account 2009-12-02 18:55:40 UTC
From: Peter Dalgaard <P.Dalgaard@biostat.ku.dk>
g.russell@eos-solutions.com wrote:
> Full_Name: George Russell
> Version: 2.10.0
> OS: Microsoft Windows XP Service Pack 2
> Submission from: (NULL) (217.111.3.131)
> 
> 
> The following code (typed into R --vanilla)
> 
> testString <- '"B1\nB2"\n1\n'
> con <- textConnection(testString)
> tab <- read.csv(con,stringsAsFactors = FALSE)
> 
> produces a data frame with with one row and one column; the name of the column
> is "B1.B2" (alright so far). However according to
> print(tab[[1,1]])
> 
> the value of the entry in the first row and first column is
> 
> "B2\n1\n"
> 
> So B2 has somehow got into both the names of the data frame and its entry.
> Either R is confused or I am. What is going on?

Presumably, read.table is not obeying quotes when removing what it
thinks is the header line. Another variation is this:

> tab <- read.table(stdin(), head=T)
0: "B1
0: B2"
1: 1
2:
> tab
  B1.B2
1   B2"
2     1


It's somehow connected to the

pushBack(c(lines, lines), file)

bits in readtable.R, but I don't quite get it.

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)              FAX: (+45) 35327907

Comment 2 Jitterbug compatibility account 2009-12-04 18:39:13 UTC
From: Prof Brian Ripley <ripley@stats.ox.ac.uk>
It's not to do with pushback per se.  The works as one might expect, 
e.g.

f <- file("test.txt", "r")
pushBack('"A1\nA2"', f)
pushBackLength(f)
scan(f, "", quote='"')

gives "A1\nA2" on a single line, then whatever was in test.txt.
Rather, the issue is

         if (header) {
             readLines(file, 1L)          # skip over header

and that stops at the embedded newline.  The fix is to read the header 
again the same way as before.

It seems to me that this is esoteric and the fix could tickle similar 
esoteric constructions, so I am only going to put the fix into R-devel 
and not the upcoming 2.10.1.


On Wed, 2 Dec 2009, Peter Dalgaard wrote:

> g.russell@eos-solutions.com wrote:
>> Full_Name: George Russell
>> Version: 2.10.0
>> OS: Microsoft Windows XP Service Pack 2
>> Submission from: (NULL) (217.111.3.131)
>>
>>
>> The following code (typed into R --vanilla)
>>
>> testString <- '"B1\nB2"\n1\n'
>> con <- textConnection(testString)
>> tab <- read.csv(con,stringsAsFactors = FALSE)
>>
>> produces a data frame with with one row and one column; the name of the column
>> is "B1.B2" (alright so far). However according to
>> print(tab[[1,1]])
>>
>> the value of the entry in the first row and first column is
>>
>> "B2\n1\n"
>>
>> So B2 has somehow got into both the names of the data frame and its entry.
>> Either R is confused or I am. What is going on?
>
> Presumably, read.table is not obeying quotes when removing what it
> thinks is the header line. Another variation is this:
>
>> tab <- read.table(stdin(), head=T)
> 0: "B1
> 0: B2"
> 1: 1
> 2:
>> tab
>  B1.B2
> 1   B2"
> 2     1
>
>
> It's somehow connected to the
>
> pushBack(c(lines, lines), file)
>
> bits in readtable.R, but I don't quite get it.
>
> --
>   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
> (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)              FAX: (+45) 35327907
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
Comment 3 Jitterbug compatibility account 2009-12-04 18:41:00 UTC
NOTES:
 Fixed for 2.11.0
Comment 4 Jitterbug compatibility account 2009-12-04 18:41:32 UTC
Audit (from Jitterbug):
Fri Dec  4 12:41:32 2009	ripley	changed notes
Fri Dec  4 12:41:32 2009	ripley	moved from incoming to In-Out-fixed