Bug 15772 - WISHLIST: modify behavior of (C entry point) nrows to handle data.frames
Summary: WISHLIST: modify behavior of (C entry point) nrows to handle data.frames
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2014-04-22 23:56 UTC by Kevin Ushey
Modified: 2014-04-25 06:54 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Ushey 2014-04-22 23:56:16 UTC
This was discussed on R-devel: https://mail.google.com/mail/u/0/#search/R-devel+row.names/1451a88b84e7b6d0

There are a few things that could be addressed here:

1. `nrows` is part of the C API (ie, it's in Rinternals.h), but AFAICS is not documented. I think it deserves a mention on R-exts or R-ints. Of course, if the reason it is left undocumented is because "it's exported, but you shouldn't use it unless you're willing to read the source code", that's okay I suppose. Although its use is illustrated in R-exts 6.1.1 (http://cran.r-project.org/doc/manuals/R-exts.html#Transient-storage-allocation)

2. `nrows` gives misleading results for things that aren't arrays. IIUC, it simply returns the length of ostensibly 1-dimensional objects, leading to things like `nrows(df)` returning the number of 'columns' for a data.frame `df`.

3. There is no 'good' routine in the C API for getting the number of rows from a data.frame. IIUC, by definition, the number of rows is defined by its 'row.names' attribute. However, this attribute can be stored in compact form, and will be expanded automatically on a `getAttrib` call -- a wasteful operation when we just want the number of rows of a `data.frame`. Checking the first column, or one of the columns, is also not preferred since a 'corrupt' data.frame will not have columns of equal length.

I propose the following function (modulo my mistakes):

int df_nrows(SEXP s) {
    if (!inherits(s, "data.frame")) error("expecting a data.frame");
    SEXP t = getAttrib0(s, R_RowNamesSymbol);
    // check for compact form
    if (isInteger(t) && INTEGER(t)[0] == NA_INTEGER && LENGTH(t) == 2)
      return abs(INTEGER(t)[1]);
    else
      return LENGTH(t);
}

or, alternatively, it could be inlined into the `nrows` function and dispatch in that way for `data.frame`s. Of course, someone out there might be depending on `nrows` returning the number of columns for their `data.frame`...

A similar argument could be made for `ncols`.
Comment 1 Brian Ripley 2014-04-23 08:32:59 UTC
This is not about the R function nrows (pace the original subject) but the C entry point.

C entry point nrows is not part of the C API (read the description in Writing R Extensions more carefully).
Comment 2 Kevin Ushey 2014-04-25 06:54:54 UTC
Thanks for the correction, Professor Ripley. This is of course made very clear in R-exts, so I apologise.

I still maintain my request that some API, or function, be made available for computing the number of rows for a data.frame be made available, since the 'best' way of doing so is not obvious, and I think many authors of C code will do something like compute the length of one of the columns (not guaranteed to be correct, e.g. for 0-column data.frames or, more rarely, if an invalid data.frame with columns of different lengths is somehow generated), or could be needlessly expensive if checking e.g. 'length(getAttrib(x, R_RowNamesSymbol))' and a compact-form row.names attribute is expanded.

Thanks for taking the time to consider my proposal.