The help page for predict.lm states that in rank-deficient cases, it is not possible to accurately check for estimability of predictions. I am certain that this issue has been considered carefully, but nonetheless, I have a suggestion for a way to do this that has worked well for me in quite a few tests (and a couple are included in this posting). If you find this useful, my suggestion would be that the predict.lm function return an NA (at least by default) for all non-estimable cases; and then the warning about rank deficiency will not be necessary. The idea is that for the model y = X*b + e, the linear function a'b is non- estimable iff X*a = 0, i.e. a is in the null space N(X) of X. So, supposing X is n*p of rank r<p, let C be a p*(p-r) matrix whose columns form a basis for N(X). Then a'b is estimable iff C'a != 0. (I show how to get C below.) In practice, I'd suggest normalizing the columns of C, and using as a criterion that max(abs(C'a))^2 > tol*a'a, where tol is the tolerance value used in the call to dqrdc2 (and available from the 'qr' member of an 'lm' object). My reasoning is that tol is used as a tolerance for a kind of a squared relative norm of the columns of Q in the QR decomposition, if I understand the comments correctly in dqrdc2.c. The other nuance here is that this must be checked BEFORE dropping the elements of the vector a that correspond to dropped columns of X (NA elements of the coefficients). This is important because those rows of C may not be zero! Here is an easy and efficient way to get the basis C... One can use the fact that if X = QR, the Q-R decomposition, then N(X) = N(R) and R is a nice small, well-behaved matrix. We need only grab the R matrix, add a diagonal to the last p-r rows, and obtain the residuals from regressing those rows on the first p rows. Then normalize and unravel the pivoting, and you have C'. This is a well-behaved operation because the first p rows of R are lower triangular (over the first p columns) and the last p-r rows are zero. So replacing the bottom corner with I_{r-p} is guaranteed to make its rank equal to p. Here is some R code that does the trick: null.basis <- function(object) { tR <- t(qr.R(object$qr)) # transpose of R rank <- object$qr$rank if (rank == nrow(tR)) return(NULL) # last few rows are zero -- add a diagonal for (i in (rank+1):nrow(tR)) tR[i,i] <- 1 # project the last p - r columns on the orthogonal complement of the first r columns null.basis <- qr.resid(qr(tR[, 1:rank]), tR[, -(1:rank)]) # if only 1 column, we have to make it a matrix if (!is.matrix(null.basis)) null.basis <- matrix(null.basis, ncol=1) # permute the rows via pivot null.basis[object$qr$pivot, ] <- null.basis # normalize and return apply(null.basis, 2, function(x) x / sqrt(sum(x^2))) } ... And an example: R> warped.lm <- lm(breaks ~ wool*tension, data=warpbreaks, subset=10:45) R> cbind(coef(warped.lm), null.basis(warped.lm)) 14 15 (Intercept) 23.4444444 0.4472136 0 woolB 4.7777778 -0.4472136 0 tensionM 0.5555556 -0.4472136 0 tensionH 1.1111111 -0.4472136 0 woolB:tensionM NA 0.4472136 0 woolB:tensionH NA 0.0000000 1 Different parameterization of the same model: R> warped.repar <- lm(breaks ~ wool*tension, data=warpbreaks, subset=10:45, + contrasts=list(wool="contr.sum",tension="contr.poly")) R> zapsmall(cbind(coef(warped.repar), null.basis(warped.repar))) 14 15 (Intercept) 26.388889 -0.377964 0.000000 wool1 -2.388889 0.000000 0.377964 tension.L 0.785674 0.000000 -0.801784 tension.Q 0.000000 -0.462910 0.000000 wool1:tension.L NA 0.801784 0.000000 wool1:tension.Q NA 0.000000 0.462910 Finally, a demo function to test estimability: can.predict <- function(object, newdata) { # Code borroed from predict.lm Terms <- delete.response(terms(object)) m <- model.frame(Terms, newdata, xlev = object$xlevels) if (!is.null(cl <- attr(Terms, "dataClasses"))) .checkMFClasses(cl, m) XX <- model.matrix(Terms, m, contrasts.arg = object$contrasts) # Test each row of XX for estimability C <- null.basis(object) t(apply(XX, 1, function(a) { test <- max(abs(t(C) %*% a)) ^ 2 / sum(a^2) c(test = test, estble = (test < 1e-7)) })) } ( faccombs <- with(warpbreaks, expand.grid(wool=levels(wool), tension=levels(tension))) ) wool tension 1 A L 2 B L 3 A M 4 B M 5 A H 6 B H R> can.predict(warped.lm, faccombs) test estble 1 2.000000e-01 0 2 9.860761e-32 1 3 1.386670e-32 1 4 1.925930e-32 1 5 6.162976e-33 1 6 2.500000e-01 0 R> can.predict(warped.repar, faccombs) test estble 1 3.857143e-01 0 2 1.132447e-32 1 3 2.311116e-32 1 4 1.959490e-32 1 5 1.872004e-32 1 6 3.857143e-01 0 We (correctly) flag the two empty cells with both parameterizations.

Created attachment 1599 [details] Modification of predict.lm that tests estimability Detailed explanation and some tests of this code are in the PDF attachment

Created attachment 1600 [details] Notations, discussion, and tests This document explains the methods used (similar to earlier post, but improved). It also shows some tests of my revised predict.lm method.

I submitted this as a wish some time ago, and now feel I can add more details and a proposed modification of predict.lm that incorporates testing of estimability. I feel this would comprise an improvement over the current warning message that is displayed when predicting with new data on a rank-deficient model. The attached PDF file explains the method in detail, and shows some tests. The method is a refinement of the one described originally. The gist of it is that in the QR decomposition of X, the row spaces of X and R are the same; so the null space of X' can be determined using R, and in particular, from the QR decomposition of R', after adding diagonal elements to make it full-rank. The PDF file discusses how this can be used to accurately assess estimability.