Bug 15072 - Assessing estimability in predict.lm in rank-deficient situations
Summary: Assessing estimability in predict.lm in rank-deficient situations
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: Models (show other bugs)
Version: R 3.1.0
Hardware: All All
: P5 enhancement
Assignee: R-core
Depends on:
Reported: 2012-10-12 21:30 UTC by Russ Lenth
Modified: 2014-05-06 15:42 UTC (History)
0 users

See Also:

Modification of predict.lm that tests estimability (8.77 KB, text/plain)
2014-05-06 15:31 UTC, Russ Lenth
Notations, discussion, and tests (194.66 KB, application/pdf)
2014-05-06 15:34 UTC, Russ Lenth

Note You need to log in before you can comment on or make changes to this bug.
Description Russ Lenth 2012-10-12 21:30:04 UTC
The help page for predict.lm states that in rank-deficient cases, it is not
possible to accurately check for estimability of predictions. I am certain that
this issue has been considered carefully, but nonetheless, I have a suggestion
for a way to do this that has worked well for me in quite a few tests (and a couple are included in this posting).

If you find this useful, my suggestion would be that the predict.lm function return an NA (at least by default) for all non-estimable cases; and then the warning about rank deficiency will not be necessary.

The idea is that for the model y = X*b + e, the linear function a'b is non-
estimable iff X*a = 0, i.e. a is in the null space N(X) of X.

So, supposing X is n*p of rank r<p, let C be a p*(p-r) matrix whose columns form
a basis for N(X). Then a'b is estimable iff C'a != 0. (I show how to get C
below.) In practice, I'd suggest normalizing the columns of C, and using as a
criterion that max(abs(C'a))^2 > tol*a'a, where tol is the tolerance value used
in the call to dqrdc2 (and available from the 'qr' member of an 'lm' object). My
reasoning is that tol is used as a tolerance for a kind of a squared relative
norm of the columns of Q in the QR decomposition, if I understand the comments
correctly in dqrdc2.c.

The other nuance here is that this must be checked BEFORE dropping the elements
of the vector a that correspond to dropped columns of X (NA elements of the
coefficients). This is important because those rows of C may not be zero!

Here is an easy and efficient way to get the basis C...
One can use the fact that if X = QR, the Q-R decomposition, then N(X) = N(R) and
R is a nice small, well-behaved matrix. We need only grab the R matrix, add a
diagonal to the last p-r rows, and obtain the residuals from regressing those
rows on the first p rows. Then normalize and unravel the pivoting, and you have
C'. This is a well-behaved operation because the first p rows of R are lower
triangular (over the first p columns) and the last p-r rows are zero. So
replacing the bottom corner with I_{r-p} is guaranteed to make its rank equal 
to p.

Here is some R code that does the trick:
null.basis <- function(object) {
    tR <- t(qr.R(object$qr))  # transpose of R
    rank <- object$qr$rank
    if (rank == nrow(tR)) 
    # last few rows are zero -- add a diagonal
    for (i in (rank+1):nrow(tR)) tR[i,i] <- 1
    # project the last p - r columns on the orthogonal complement of the first r columns
    null.basis <- qr.resid(qr(tR[, 1:rank]), tR[, -(1:rank)])
    # if only 1 column, we have to make it a matrix
    if (!is.matrix(null.basis)) null.basis <- matrix(null.basis, ncol=1)
    # permute the rows via pivot
    null.basis[object$qr$pivot, ] <- null.basis
    # normalize and return
    apply(null.basis, 2, function(x) x / sqrt(sum(x^2)))

... And an example:
R> warped.lm <- lm(breaks ~ wool*tension, data=warpbreaks, subset=10:45)
R> cbind(coef(warped.lm), null.basis(warped.lm))
                                  14 15
(Intercept)    23.4444444  0.4472136  0
woolB           4.7777778 -0.4472136  0
tensionM        0.5555556 -0.4472136  0
tensionH        1.1111111 -0.4472136  0
woolB:tensionM         NA  0.4472136  0
woolB:tensionH         NA  0.0000000  1

Different parameterization of the same model:
R> warped.repar <- lm(breaks ~ wool*tension, data=warpbreaks, subset=10:45, 
+     contrasts=list(wool="contr.sum",tension="contr.poly"))
R> zapsmall(cbind(coef(warped.repar), null.basis(warped.repar)))
                                 14        15
(Intercept)     26.388889 -0.377964  0.000000
wool1           -2.388889  0.000000  0.377964
tension.L        0.785674  0.000000 -0.801784
tension.Q        0.000000 -0.462910  0.000000
wool1:tension.L        NA  0.801784  0.000000
wool1:tension.Q        NA  0.000000  0.462910

Finally, a demo function to test estimability:

can.predict <- function(object, newdata) {
    # Code borroed from predict.lm
    Terms <- delete.response(terms(object))
    m <- model.frame(Terms, newdata, xlev = object$xlevels)
    if (!is.null(cl <- attr(Terms, "dataClasses"))) 
        .checkMFClasses(cl, m)
    XX <- model.matrix(Terms, m, contrasts.arg = object$contrasts)
    # Test each row of XX for estimability
    C <- null.basis(object)
    t(apply(XX, 1, function(a) {
        test <- max(abs(t(C) %*% a)) ^ 2 / sum(a^2)
        c(test = test, estble = (test < 1e-7))

( faccombs <- with(warpbreaks, 
    expand.grid(wool=levels(wool), tension=levels(tension))) )
  wool tension
1    A       L
2    B       L
3    A       M
4    B       M
5    A       H
6    B       H

R> can.predict(warped.lm, faccombs)
          test estble
1 2.000000e-01      0
2 9.860761e-32      1
3 1.386670e-32      1
4 1.925930e-32      1
5 6.162976e-33      1
6 2.500000e-01      0

R> can.predict(warped.repar, faccombs)
          test estble
1 3.857143e-01      0
2 1.132447e-32      1
3 2.311116e-32      1
4 1.959490e-32      1
5 1.872004e-32      1
6 3.857143e-01      0

We (correctly) flag the two empty cells with both parameterizations.
Comment 1 Russ Lenth 2014-05-06 15:31:37 UTC
Created attachment 1599 [details]
Modification of predict.lm that tests estimability

Detailed explanation and some tests of this code are in the PDF attachment
Comment 2 Russ Lenth 2014-05-06 15:34:00 UTC
Created attachment 1600 [details]
Notations, discussion, and tests

This document explains the methods used (similar to earlier post, but improved). It also shows some tests of my revised predict.lm method.
Comment 3 Russ Lenth 2014-05-06 15:40:53 UTC
I submitted this as a wish some time ago, and now feel I can add more details and a proposed modification of predict.lm that incorporates testing of estimability. I feel this would comprise an improvement over the current warning message that is displayed when predicting with new data on a rank-deficient model.

The attached PDF file explains the method in detail, and shows some tests. The method is a refinement of the one described originally. The gist of it is that in the QR decomposition of X, the row spaces of X and R are the same; so the null space of X' can be determined using R, and in particular, from the QR decomposition of R', after adding diagonal elements to make it full-rank. The PDF file discusses how this can be used to accurately assess estimability.