Bug 8877 - predict.lm does not have a weights argument for newdata
Summary: predict.lm does not have a weights argument for newdata
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: old
Hardware: All Linux
: P5 normal
Assignee: Jitterbug compatibility account
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-05-18 20:14 UTC by Jitterbug compatibility account
Modified: 2014-02-16 11:43 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jitterbug compatibility account 2006-05-18 20:14:23 UTC
From: jranke@uni-bremen.de
Full_Name: Johannes Ranke
Version: 2.3.0
OS: Linux-i386
Submission from: (NULL) (134.102.60.74)


In the case that predict.lm is used on an object resulting from weighted linear
regression with interval="prediction", the prediction intervals currently depend
on 
the absolute size of object$weights:

data(anscombe); attach(anscombe)
m1 <- lm(y1 ~ x1, w = rep(1,length(x1)))
m2 <- lm(y1 ~ x1, w = rep(3,length(x1)))
predict(m1, data.frame(x1 = 5), interval = "prediction")
predict(m2, data.frame(x1 = 5), interval = "prediction")

This make sense only if the weights are taken to be the numbers of replicates
used
for deriving the x values in newdata, and even then, the given prediction
interval
is only correct if the number of replicates for establishing the x values in
newdata is always one.

The desired behavior would be that the user gives a vector weights.newdata for
newdata, matching the weighting scheme applied for the weighted regression (e.g.
calculated from a variance function).

My education in statistics is only medium, so I am not sure if my proposed
solution is correct. Please check my patch to fix this:

--- lm.R.orig   2006-04-10 00:19:39.000000000 +0200
+++ lm.R        2006-05-18 17:10:29.000000000 +0200
@@ -591,7 +591,8 @@
     function(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
             interval = c("none", "confidence", "prediction"),
             level = .95,  type = c("response", "terms"),
-            terms = NULL, na.action = na.pass, ...)
+            terms = NULL, na.action = na.pass, ...,
+       weights.newdata = rep(1, length(newdata[[1]])))
 {
     tt <- terms(object)
     if(missing(newdata) || is.null(newdata)) {
@@ -630,6 +631,11 @@
                r <- object$residuals
                w <- object$weights
                rss <- sum(if(is.null(w)) r^2 else r^2 * w)
+    if(!is.null(w) && interval == "prediction" && weights.newdata ==
rep(1,length(newdata[[1]]))) {
+        warning(paste(
+          "To find prediction intervals from linear models with weights,",
+          "you probably want weights.newdata different from one."))
+    }
                df <- n - p
                rss/df
            } else scale^2
@@ -715,7 +721,7 @@
        tfrac <- qt((1 - level)/2, df)
        hwid <- tfrac * switch(interval,
                               confidence = sqrt(ip),
-                              prediction = sqrt(ip+res.var)
+                              prediction = sqrt(ip+res.var/weights.newdata)
                               )
        if(type != "terms") {
            predictor <- cbind(predictor, predictor + hwid %o% c(1, -1))

---------------------------end patch-------------------------------------------

if you apply this to lm.R, you get the same result from both lines:

predict(m1, data.frame(x1 = 5), interval = "prediction") 
predict(m2, data.frame(x1 = 5), interval = "prediction", weights.newdata = 3)

except that you get a warning if you set all newweights to one (default),
because this is probably not what you want for a prediction interval from
weighted regression.

All it does is to (down)scale the part of the variance that each new data point
contributes to the variance of the predicted y.
Comment 1 Jitterbug compatibility account 2006-05-21 01:50:54 UTC
From: Johannes Ranke <jranke@uni-bremen.de>
Dear R developers,

I am a little disappointed that my bug report only made it to the
wishlist, with the argument:

	Well, it does not say it has.
	Only relevant to prediction intervals.

predict.lm does calculate prediction intervals for linear models from 
weighted regression, so they should be correct, right? 

As far as I can see they are bound to be wrong in almost all cases, if
no weights for newdata can be given. So the point is that predict.lm
needs such an argument in order to give correct prediction intervals for
models from weighted linear regression.

Also, it strikes me that in the absence of a "newdata" argument, the
weights from the "lm" object need to be taken into account for
constructing prediction intervals.

My updated proposal fixing both points as well as the help file can be found at:

	http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.patch

and I wrote up a small demonstration of the problem and my proposed solution:

	http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.pdf

Kind regards,

Johannes Ranke

-- 
Dr. Johannes Ranke                 jranke@uni-bremen.de
UFT Bremen, Leobenerstr. 1         +49 421 218 8971 
D-28359 Bremen                     http://www.uft.uni-bremen.de/chemie/ranke
Comment 2 Jitterbug compatibility account 2006-05-22 20:37:40 UTC
From: Johannes Ranke <jranke@uni-bremen.de>
Dear R Developers,

Prediction intervals for weighted regression are treated in 

	Brown, P. J. (1994) Measurement, Regression and Calibration. Oxford

on p. 35. However, the prediction interval for Z is not really
spelled out in a way that would be accessible for me.

So I give up on this. I find my proposed solution (further updated since
my last mail) plausible, but I don't know and can't find enough theory
to support it.
	
	http://www.uft.uni-bremen.de/chemie/ranke/r-patches/predict.lm.patch

Ceterum censeo PR#8877 esse corrigendum.

Johannes Ranke

-- 
Dr. Johannes Ranke                 jranke@uni-bremen.de
UFT Bremen, Leobenerstr. 1         +49 421 218 8971 
D-28359 Bremen                     http://www.uft.uni-bremen.de/chemie/ranke
Comment 3 Jitterbug compatibility account 2006-05-24 10:34:40 UTC
From: Prof Brian Ripley <ripley@stats.ox.ac.uk>
I am more than 'a little disappointed' that you expect a detailed 
explanation of the problems with your 'bug' report, especially as you did 
not provide any explanation yourself as to your reasoning (nor did you 
provide any credentials nor references).

Note that

1) Your report did not make clear that this was only relevant to 
prediction intervals, which are not commonly used.

2) Only in some rather special circumstances do weights enter into 
prediction intervals, and definitely not necessarily the weights used for 
fitting.  Indeed, it seems that to label the variances that do enter as 
inverse weights would be rather misleading.

3) In a later message you referenced Brown's book, which is dealing with a 
different model.

The model fitted by lm is

 	y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2)

(Row vector x, column vector \beta.)

If the observations are from the model, OLS is appropriate, but weighting 
is used in several scenarios, including:

(a) case weights:  w_i = 3 means `I have three observations like (y, x)'

(b) inverse-variance weights, most often an indication that w_i = 1/3 
means that y_i is actually the average of 3 observations at x_i.

(c) multiple imputation, where a case with missing values in x is split 
into say 5 parts, with case weights less than and summing to one.

(d) Heteroscedasticity, where the model is rather

         y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))

And there may well be other scenarios, but those are the most common (in 
decreasing order) in my experience.


Now, consider prediction intervals.  It would be perverse to consider 
these to be for other than a single future observation at x.  In scenarios 
(a) to (c), R's current behaviour is what is commonly accepted to be 
correct (and you provide no arguments otherwise). If a future observation 
has missing values, predict.lm would only be a starting point for multiple 
imputation.

Even if 'newdata' is not supplied, prediction intervals must apply to new 
observations, not the existing ones (or the formula used is wrong: perhaps 
to avoid your confusion they should not be allowed in that case).

Only in case (d), which is a different model, is it appropriate to supply 
error variances (not weights) for prediction intervals.  This is why I 
marked it for the wishlist.  Equally, one might want to specify
\sigma^2 for all future observations as being different from the model 
fitting, as the training data may include other components of variance in 
their error variances.


On Sat, 20 May 2006, jranke@uni-bremen.de wrote:

> Dear R developers,
>
> I am a little disappointed that my bug report only made it to the
> wishlist, with the argument:
>
>       Well, it does not say it has.
>       Only relevant to prediction intervals.
>
> predict.lm does calculate prediction intervals for linear models from
> weighted regression, so they should be correct, right?
>
> As far as I can see they are bound to be wrong in almost all cases, if
> no weights for newdata can be given. So the point is that predict.lm
> needs such an argument in order to give correct prediction intervals for
> models from weighted linear regression.
>
> Also, it strikes me that in the absence of a "newdata" argument, the
> weights from the "lm" object need to be taken into account for
> constructing prediction intervals.

Where are the references and arguments?

> My updated proposal fixing both points as well as the help file can be found
> at:
>
>       http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.patch

Not found.

> and I wrote up a small demonstration of the problem and my proposed solution:
>
>       http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.pdf

That example is not a valid use of WLS, as you have the weights depending 
on the data you are fitting.

> Kind regards,
>
> Johannes Ranke
>
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
Comment 4 Jitterbug compatibility account 2006-05-24 15:59:00 UTC
NOTES:
 Well, it does not say it has.
Only relevant to prediction intervals.
pred.var argument added for 2.4.0
Comment 5 Jitterbug compatibility account 2006-05-24 15:59:49 UTC
Audit (from Jitterbug):
Fri May 19 13:16:35 2006	ripley	changed notes
Fri May 19 11:16:35 2006	ripley	moved from incoming to wishlist
Wed May 24 12:59:49 2006	ripley	changed notes
Wed May 24 10:59:49 2006	ripley	moved from wishlist to wishlst-fulfilled
Comment 6 Jitterbug compatibility account 2006-05-25 03:52:32 UTC
From: Johannes Ranke <jranke@uni-bremen.de>
Prof. Ripley,

thank you for being more explicit now! Thank you also for the fix to
the wish that you derived from my bug report. Here comes the rationale
for my updated patch, which I *humbly* propose as a more general
solution to the problem.

I have spent several days now on this problem, but I am no statistician,
so please excuse my ignorance of any notational or other conventions
that I might disregard below. I tried hard to do it right.

Judging from the documentation to lm, I find that lm has only *one*
perspective on weights which I propose is reasonable and sufficient. It
minimises  "sum(w*e^2))", more clearly expressed as 

	sum(w_i * e_i^2)

I infer that the point is to construct the weights such that 
the errors

	\epsilon = sqrt(w_k) * \epsilon_k 
	
are from a normal distribution N(0, \sigma^2), where index k covers all
observations, past as well as future ones, the model is

	y = x\beta + \epsilon_k

with 

	\epsilon_k \sim N(0, \sigma_k^2)
	
and

	\sigma_k^2 = \sigma^2 / w_k

This is my view on this, it might be naive, it might be wrong, but
if so, I can't see my mistake.

An estimator for \sigma^2 is then

	sum(w_i * e_i^2) / df

which is called res.var in the R code to predict.lm. An estimator
for \sigma_k^2, the variance for observation k, is therefore

	res.var / w_k

which is what my proposed patch, which can now be found in an updated
form under

	 http://www.uft.uni-bremen.de/chemie/ranke/r-patches/predict.lm.patch.r38195

implements. I removed the first version called lm.predict.patch because
it did not correctly deal with prediction intervals for old data, sorry
for the inconvenience ("Not found"). Meanwhile you disabled prediction
intervals for old data, which the above patch reverts (no offense,
please, I just think if predict.lm does confidence intervals for the
regression line at the location of old data points, it might as well do
illustrative prediction intervals at these locations. At least for the
old data, the weights are known from the model object.)

I tested my solution with the script

	http://www.uft.uni-bremen.de/chemie/ranke/r-patches/test.predict.lm.R

* Prof Brian Ripley <ripley@stats.ox.ac.uk> [060524 07:50]:
> I am more than 'a little disappointed' that you expect a detailed 
> explanation of the problems with your 'bug' report, especially as you did 
> not provide any explanation yourself as to your reasoning (nor did you 
> provide any credentials nor references).

I am sorry for this, and I worked hard to supply the information in the
followups including this one.
 
> Note that
> 
> 1) Your report did not make clear that this was only relevant to 
> prediction intervals, which are not commonly used.

I am not really sure if confidence intervals couldn't be improved
in scenario (d). 
 
> 2) Only in some rather special circumstances do weights enter into 
> prediction intervals, and definitely not necessarily the weights used for 
> fitting.  

Yes, under these circumstances R would then need weights from the user
in order to construct prediction intervals. 

> Indeed, it seems that to label the variances that do enter as 
> inverse weights would be rather misleading.

Possibly. I assume it can be correctly done and documented (see my patch
for a proposal).

> 3) In a later message you referenced Brown's book, which is dealing with a 
> different model.
> 
> The model fitted by lm is
> 
>       y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2)
> 
> (Row vector x, column vector \beta.)

I assumed that the model in Brown's book is a special case of the model
fitted by lm, the only difference being that x is a row vector (1,
x_B), where x_B is Brown's random variable X, and \beta contains
\alpha_B and \beta_B.
 
> If the observations are from the model, OLS is appropriate, but weighting 
> is used in several scenarios, including:
> 
> (a) case weights:  w_i = 3 means `I have three observations like (y, x)'

I am not sure what you mean by this. Do you mean you have three exactly
equal observations? In this case this could be regarded as a special
case of (b) and w_i = 1/3 would be used as input for lm.

What does the "'" mean in (y, x)'?

> (b) inverse-variance weights, most often an indication that w_i = 1/3 
> means that y_i is actually the average of 3 observations at x_i.

Yes, and I take the reasoning for this to be as follows: An estimator
for the variance of the means of n observations is 1/n times the
variance of the single observations. Why shouldn't this apply to the
means of multiple future observations?

> (c) multiple imputation, where a case with missing values in x is split 
> into say 5 parts, with case weights less than and summing to one.

I don't understand this, but I suppose lm works in the same way for
weights w_i derived from such a procedure.
 
> (d) Heteroscedasticity, where the model is rather
> 
>         y = x\beta + \epsilon, \epsilon \sim N(0, \sigma^2(x))
> 
> And there may well be other scenarios, but those are the most common (in 
> decreasing order) in my experience.

Let me illustrate what my proposal would mean for the different
scenarios:

For scenario (b) we have pairs of n_i and x_i, where n_i is the
number of observations made on case i. I postulate that w_k = 1/n_k,
where n_k is the number of replicate measurements done at sample k, be
it for old or for new data. I don't see what's perverse in assuming 
n_j > 1 for future observations j.

pred.var which you introduced becomes pred.var/n_k, which is equal to
res.var if n_k = 1. If the user has the chance to use weights.newdata,
he can give 1/n_k explicitly and arrive at the correct estimate for
\sigma_k^2 even without knowing the estimate for \sigma^2.

For scenario (d), we can write

	\sigma_k^2 = \sigma^2 / w(x)

with w(x) = w_k. This means, if the weights w_i were calculated using
a function w(x), the same function would need to be applied to get
the weights for newdata w_j = w(x_j).

> Now, consider prediction intervals. 

Yes.

> It would be perverse to consider these to be for other than a single
> future observation at x. 

I can't follow you here. If the weights are 1/n_i and y_i are means from n_i 
observations as in scenario (b), it is of practical interest to be able to
give prediction intervals for the means of n_j new observations on cases j.
I can give an example, if you like.

> In scenarios 
> (a) to (c), R's current behaviour is what is commonly accepted to be 
> correct (and you provide no arguments otherwise). 

I don't see how weights according to (a) would give meaningful answers,
except if they are inversed as I propose above. 

I see that R provides meaningful results for scenario (b), given
prediction intervals are not sought for means from multiple observations.

For scenario (c), I suspect that w_j for future observations should be
mean(w_i), except if you have downweighted cases by w_i that are not
expected to happen in future observations any more. In that case, w_j 
would be 1, which is equivalent to the behaviour of R with or without 
my patch.

> If a future observation 
> has missing values, predict.lm would only be a starting point for multiple 
> imputation.

I can't follow you here.
 
> Even if 'newdata' is not supplied, prediction intervals must apply to new 
> observations, not the existing ones (or the formula used is wrong:

I think that the formula used in the R implementation is wrong, since
the manual says it gives prediction intervals for the existing
observations, so their weights have to be considered in order to arrive
at the correct variance \sigma_k^2.

> perhaps to avoid your confusion they should not be allowed in that
> case).

It is not necessary to disallow them.

> Only in case (d), which is a different model, 

I strongly object. According to the documentation, there is only one
way of treating weighted regression in lm. And I think this is 
reasonable.

> is it appropriate to supply 
> error variances (not weights) for prediction intervals.  

This possibility is conserved applying my patch proposed for r38195.

However, if w(x) is known as in case (d), it is easier to give w(x_j)
for newdata x_j.

At current, the default values for the variances for new data are taken
to be all equal and are defined by 

	res.var <-
	    if (is.null(scale)) {
		r <- object$residuals
		w <- object$weights
		rss <- sum(if(is.null(w)) r^2 else r^2 * w)
		df <- n - p
		rss/df
	    } else scale^2

which becomes pred.var by your latest change.

This means that sum(r^2 * w) / df is used as an estimator for the
variance of future observations \sigma_k^2. However, it follows from the
above that pred.var should be calculated from

	pred.var = res.var / weights.newdata

which forms the basis for my updated patch.

> This is why I 
> marked it for the wishlist.  Equally, one might want to specify
> \sigma^2 for all future observations as being different from the model 
> fitting, as the training data may include other components of variance in 
> their error variances.
> 
> 
> On Sat, 20 May 2006, jranke@uni-bremen.de wrote:
> 
> >Dear R developers,
> >
> >I am a little disappointed that my bug report only made it to the
> >wishlist, with the argument:
> >
> >     Well, it does not say it has.
> >     Only relevant to prediction intervals.
> >
> >predict.lm does calculate prediction intervals for linear models from
> >weighted regression, so they should be correct, right?
> >
> >As far as I can see they are bound to be wrong in almost all cases, if
> >no weights for newdata can be given. So the point is that predict.lm
> >needs such an argument in order to give correct prediction intervals for
> >models from weighted linear regression.
> >
> >Also, it strikes me that in the absence of a "newdata" argument, the
> >weights from the "lm" object need to be taken into account for
> >constructing prediction intervals.
> 
> Where are the references and arguments?
> 
> >My updated proposal fixing both points as well as the help file can be 
> >found at:
> >
> >     http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.patch
> 
> Not found.
> 
> >and I wrote up a small demonstration of the problem and my proposed 
> >solution:
> >
> >     http://www.uft.uni-bremen.de/chemie/ranke/r-patches/lm.predict.pdf
> 
> That example is not a valid use of WLS, as you have the weights depending 
> on the data you are fitting.

I don't understand. I estimated the variance function from the data,
what's wrong with that?

Thank you for your attention!

Johannes Ranke
 
> >Kind regards,
> >
> >Johannes Ranke
> >
> >
> 
> -- 
> Brian D. Ripley,                  ripley@stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Dr. Johannes Ranke                 jranke@uni-bremen.de
UFT Bremen, Leobenerstr. 1         +49 421 218 8971 
D-28359 Bremen                     http://www.uft.uni-bremen.de/chemie/ranke
Comment 7 Jackie Rosen 2014-02-16 11:43:48 UTC
(spam comment removed)