Bug 14992 - row names of model.matrix
Summary: row names of model.matrix
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R 2.15.x
Hardware: All All
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2012-07-16 10:28 UTC by Sebastian Meyer
Modified: 2017-07-08 21:01 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastian Meyer 2012-07-16 10:28:28 UTC
Dear R core,

I just observed that model.matrix is inconsistent in returning row.names.
Here is an example (taken from the help page):

ff <- log(Volume) ~ log(Height) + log(Girth)
utils::str(m <- model.frame(ff, trees))
mat <- model.matrix(ff, m)

Here, the model.matrix derived the rownames from m. This is also true for, e.g.,

model.matrix(~ log(Height), m)

but rownames() are NULL in the cases

model.matrix(~ 1, m)
or
model.matrix(~ 0, m)

I think that currently it is not specified / documented, which row names the model.matrix should have. I would suggest to always derive the row names from the corresponding model.frame.

Best regards,
   Sebastian Meyer



R Version:
platform = x86_64-unknown-linux-gnu
arch = x86_64
os = linux-gnu
system = x86_64, linux-gnu
status = Under development (unstable)
major = 2
minor = 16.0
year = 2012
month = 07
day = 11
svn rev = 59772
language = R
version.string = R Under development (unstable) (2012-07-11 r59772)
nickname = Unsuffered Consequences
Comment 1 Martin Maechler 2012-07-20 14:56:32 UTC
Thank you, for the clear report.

Though I don't see how the current behavior can become a problem,
I found it pretty easy to fix, and I do agree that consistency is desirable.
Comment 2 Brian Ripley 2012-07-21 06:41:48 UTC
Actually, this change is a problem.  It changes the output in several packages, including betareg and gstat.
Comment 3 Sebastian Meyer 2017-06-20 14:22:04 UTC
Now, 5 years later, I again stumbled upon this issue, finding that the fix in r59911 actually does not do what the NEWS (of R 2.15.2) promise:

"model.matrix(~1, ...) now also contains the same rownames that less trivial formulae produce. (Wish of PR#14992, changes the output of several packages.)"

In fact, model.matrix.default(~ 1, ...) simply has row.names equal to automatic row.names, not using the ones from the underlying model.frame.
Here is an example:

# set some row.names to see what happens
row.names(trees) <- 42 + seq_len(nrow(trees))
ff <- log(Volume) ~ log(Height) + log(Girth)
m <- model.frame(ff, trees)
model.matrix(~ log(Height), m)
model.matrix(~ 1, m)

To fix this,

> data <- data.frame(x=rep(0, nrow(data)))
in model.matrix.default would have to be modified to retain the original row.names. Maybe by adding the argument row.names = row.names(data) or by switching to something like

> data[["x"]] <- rep.int(0, nrow(data))

A change might again affect the output of some packages...
Comment 4 Martin Maechler 2017-07-08 19:51:30 UTC
You are right in your assessment.

At the  useR! meeting,  or actually the day before at the (closed group) DSC,
Luke Tierney gave a talk mentioning that  row names construction of  model.matrix() indeed uses 90% of an lm() call in the case of large n (n=1e7 IIRC)... and was thinking aloud that / asserting  that the row names are never used there  anyway.   So we may re-consider the API of model.matrix() eventually.
OTOH, I think there may be too much code using model.matrix() "stand-alone", i.e., not in conjunction with model.frame.

In any case, I'm having a look at this.
Thank you for reopening the issue.
Comment 5 Martin Maechler 2017-07-08 21:01:20 UTC
If package output change, as they may well,  
I think it is fair to say they should update the output.. 

Commited as svn c 72903  (to R-devel only)