Bug 16158 - Error in predict.lm for rank-deficient cases
Summary: Error in predict.lm for rank-deficient cases
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: Models (show other bugs)
Version: R 3.1.2
Hardware: All All
: P5 normal
Assignee: R-core
Depends on:
Reported: 2015-01-14 22:20 UTC by Russ Lenth
Modified: 2015-02-17 11:49 UTC (History)
1 user (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Russ Lenth 2015-01-14 22:20:25 UTC
This is a followup to my previous wishlist report, #15072. It was submitted years ago, but there has been no response. Meanwhile, the bug I am reporting here persists.

The help page for predict.lm states that it is not possible to determine estimability of predictions for new data when the model matrix is rank-deficient. This is incorrect. The theory of estimability is well-established. See, for example, Chapter 3 of Monahan, John F. (2007), *A Primer on Linear Models*, CRC Press. In particular, Methods 3.2 and 3.3 on page 40 of that text show how estimability can be determined. 

Using such methods, predictions from a rank-deficient model need not be misleading. The estimability of each linear function can be checked within a suitable tolerance, and in those cases where the function is deemed non-estimable, the prediction can be flagged in some way, such as outputting it as `NA`. Perhaps you want to provide an option to not test estimability, but testing it should be the default when `newdata` is present. An implementation of Method 3.3 is offered in the **lsmeans** package, and you are welcome to adapt it to your needs.

I wonder if there is confusion on this issue based on the use of `NA` in the returned coefficients for predictors that are thrown out. In the example below, lm(y ~ x1+x2+x3+x4) returns coefficients of b = (b0,...,b4) = (5,3,0,NA,NA). Consider predicting at x = (1,2,-1,5). These settings have nonzero coefficients for b3 and b4, while b3 and b4 are NA. It may SEEM problematic to have nonzero coefficients where there are NAs, but it isn't -- because, really, the NAs are NOT missing; they denote positions where the solution is constrained to zero. The actual solution returned by lm is b = (5,3,0,0,0); and a prediction at x = ((1,2,-1,5) is definitely estimable: this is just the 6th row in the data matrix. 

Also, b is just one of infinitely many solutions. By entering variables in different orders, we can obtain other solutions to the normal equations such as c = c(-19,NA,NA,3,6) and d = (NA,4.25,-1.25,NA,1.25), where in each case the NAs are really just 0. The point is not where the NAs are, it is whether the predictions are the same for all solutions.

My lsmeans package has functions lsmeans::nonest.basis and lsmeans::is.estble that implement Monahan's Method 3.3. nonest.basis returns an orthonormal basis for the null row space of the data matrix. It uses the fact that if X = QR, the row space of R uis the same as the row space of X. So it obtains t(R), augments it so it is full-rank, and applies qr(). The augmented part of the 'Q' part is then the required basis. The function is.estble checks whether each row is orthogonal (within a tolerance) to every column of this null row basis. More details are provided in my previous wish-list submission, #15072. Please feel welcome to adapt this code to suit your needs.


x1 <- -4:4
x2 <- c(-2,1,-1,2,0,2,-1,1,-2)
x3 <- 3*x1 - 2*x2
x4 <- x2 - x1 + 4
y <- 1 + x1 + x2 + x3 + x4 + c(-.5,.5,.5,-.5,0,.5,-.5,-.5,.5)

##      x1 x2  x3 x4    y
## [1,] -4 -2  -8  6 -7.5
## [2,] -3  1 -11  8 -3.5
## [3,] -2 -1  -4  5 -0.5
## [4,] -1  2  -7  7  1.5
## [5,]  0  0   0  4  5.0
## [6,]  1  2  -1  5  8.5
## [7,]  2 -1   8  1 10.5
## [8,]  3  1   7  2 13.5
## [9,]  4 -2  16 -2 17.5
 # Fit a model
 mod1234 <- lm(y ~ x1 + x2 + x3 + x4)
 # test data:
testset <- data.frame( 
              x1 = c(3, 6, 6, 0, 0, 1), 
              x2 = c(1, 2, 2, 0, 0, 2), 
              x3 = c(7,14,14, 0, 0, 3), 
              x4 = c(2, 4, 0, 4, 0, 4))

# Create model matrix for newdata              
get.matrix <- function(model, newdata) {
    trms = delete.response(terms(model))
    m = model.frame(trms, newdata)
    model.matrix(trms, m)

N1234 <- lsmeans::nonest.basis(mod1234$qr)
X <- get.matrix(mod1234, testset)

lsmeans::is.estble(X, N1234)
##    1     2     3     4     5     6 
Comment 1 Russ Lenth 2015-02-06 22:31:48 UTC
Update: To make these better available for package developers, the functions nonest.basis and is.estble are now moved from *lsmeans* to the *estimability* package, available on CRAN.
Comment 2 David Firth 2015-02-17 11:49:43 UTC
In the hope that it's helpful to report here my view of this, having given this problem some thought after seeing it arise in applications:

I agree with Russ Lenth's analysis of this problem with predict.lm, and his suggested solution (which I have not checked in full detail, though).  The fact that output from predict.lm can be misleading is currently the subject of a warning to the user, but this seems less than ideal when the "misleading" cases can be detected and handled more informatively.  This probably isn't strictly a bug, since the current bahaviour is as documented.  But I think the statement in ?predict.lm that "That cannot be checked accurately" is too strong, and is not correct in the sense that /accurately enough/ ought to be achievable.  If a given "newdata" point is /very close/ to the subspace necessary for estimability, then continuity limits the extent to which the predict.lm output can be misleading.  On the other hand, newdata-values that are far from that subspace will usually yield (sometimes grossly) misleading output from predict.lm, and this can be reliably detected.