Bug 16283 - read.csv Incorrectly Assigns row.names when there is 1 too many columns
Summary: read.csv Incorrectly Assigns row.names when there is 1 too many columns
Status: REOPENED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R 3.0.2
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-03-26 18:10 UTC by Bill Denney
Modified: 2017-01-18 03:56 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bill Denney 2015-03-26 18:10:37 UTC
I have a .csv file with one more column than column headers.  When I loaded the file with row.names=NULL, it gave a first column called "row.names" unexpectedly.

To replicate:

#Example 1 works as expected
> mytext <- textConnection(c('"a","b","c"', '1,2,3,4'))
> read.csv(mytext)
  a b c
1 2 3 4

#Example 2 does not work as expected
> mytext <- textConnection(c('"a","b","c"', '1,2,3,4'))
> read.csv(mytext, row.names=NULL)
  row.names a b c
1         1 2 3 4

#Example 3 gives an error as expected
> mytext <- textConnection(c('"a","b","c"', '1,2,3,4,5'))
> read.csv(mytext, row.names=NULL)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  more columns than column names
Comment 1 Peter Dalgaard 2015-03-26 18:25:06 UTC
As documented!

    If ‘row.names’ is not specified and the header line has one less
     entry than the number of columns, the first column is taken to be
     the row names.  This allows data frames to be read in from the
     format in which they are printed.  If ‘row.names’ is specified and
     does not refer to the first column, that column is discarded from
     such files.
Comment 2 Bill Denney 2015-03-26 19:11:01 UTC
(In reply to Peter Dalgaard from comment #1)
> As documented!
> 
>     If ‘row.names’ is not specified and the header line has one less
>      entry than the number of columns, the first column is taken to be
>      the row names.  This allows data frames to be read in from the
>      format in which they are printed.  If ‘row.names’ is specified and
>      does not refer to the first column, that column is discarded from
>      such files.

I think that the documentation is covering my example 1 (slightly changed so that the unusual row numbering is more obvious):

> mytext <- textConnection(c('"a","b","c"', '5,2,3,4'))
> print(mydata <- read.csv(mytext))
  a b c
5 2 3 4
> mydata$row.names
NULL

A bit of an expansion on what I view to be an error in example 2:

> mytext <- textConnection(c('"a","b","c"', '5,2,3,4'))
> print(mydata <- read.csv(mytext, row.names=NULL))
  row.names a b c
1         5 2 3 4
> mydata$row.names
[1] "5"

That seems different than the documentation.  I don't see a reason that a column called "row.names" should be created.
Comment 3 Peter Dalgaard 2015-03-26 20:30:36 UTC
It is at least not contrary to documentation. What actually happens is that 

    rlabp <- (cols - col1) == 1L
...
    if (rlabp) 
        col.names <- c("row.names", col.names)
...
    if (missing(row.names)) {
        if (rlabp) {
            row.names <- data[[1L]]
            data <- data[-1L]
            keep <- keep[-1L]
            compactRN <- FALSE
        }
        else row.names <- .set_row_names(as.integer(nlines))
    }
    else if (is.null(row.names)) {
        row.names <- .set_row_names(as.integer(nlines))
    }
 

so if you give data in the format that implies that there are row names in the first column AND set row.names=NULL, implying rownames 1:nlines, you get a little bit of both. 

I.e. there are two mechanisms for setting rownames, but no ducumentation for what happens if you use both at once. What actually happens does not seem unreasonable to me.
Comment 4 Bill Denney 2015-06-23 12:39:41 UTC
Sorry for the long delay in response.

If this is viewed as reasonable can the documentation be updated to specify that is what happens in this (unusual) instance?

Since the example suggests that the user knows the column names, adding a new column name of "row.names" seems like it would be a surprise to most users.
Comment 5 loligo@sohu.com 2017-01-18 03:56:05 UTC
I encounter the same problem and I agree with Bill.  A table file with ragged lines is very common when the separator is "\t" (tab) or " " (white space), even more common than that of CSV since they are invisible.  Now I have two choices: 1) manually remove the redundant tab or white space; 2) read the table file with header=F and stringsAsFactors=T, then parse the data.frame containing character in all fields, e.g., extracting the first line as header and convert the other lines into numeric.

I think it will be very convenient if the argument row.names of read.table() accepts a value, such as NULL, a single character scalar, or a character vector, allowing the user to indicate that although the header line has one less entry than the number of columns of the second line, the first column is NOT the row names.  Rather, since the first column is NOT the row names, read.table() should respect flush=T and pad the following lines.  Of course, the padding cells get value NA.  Actually, it is the behaviour of read.table() NOW if the first line has one MORE field (an empty field created by a redundant tab) than the following lines.

In addition, the documentation is misleading, if it is not wrong.