Created attachment 2170 [details]
lm.SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
plot(lm.SR, which = 6)
The plot shows that Japan has a standard residual between 0.5 and 1.
But the actual value should be
It is caused by the incorrect formula used in computing the standard residuals. Check the attached patch.
To be specific, the cause is that the divisor p in Cook's distance has been omitted. (The proportionality factor is rstd/p rather than rstd). The patch seems sensible.
cooks.distance.lm has p <- model$rank
plot.lm has p <- length(coef(x))
which won't normally differ, but can they differ? If they can, which one is correct? I think they ought to be consistent.
(In reply to Peter Dalgaard from comment #1)
> To be specific, the cause is that the divisor p in Cook's distance has been
> omitted. (The proportionality factor is rstd/p rather than rstd). The patch
> seems sensible.
Thank you Peter, for checking.
> Side question:
> cooks.distance.lm has p <- model$rank
> plot.lm has p <- length(coef(x))
> which won't normally differ, but can they differ? If they can, which one is
> correct? I think they ought to be consistent.
Yes, they can differ quite a bit as soon as you have singularities.
I'm pretty sure model$rank is correct not the least because that's also used by
That, BTW, returns a 'df' component which is (from help page):
df: degrees of freedom, a 3-vector (p, n-p, p*), the first being
the number of non-aliased coefficients, the last being the
total number of coefficients.
and p is rank whereas p* is length(coef(.)).
As I am listed as one "perpetrator" co-authoring this monster-long R function,
I will commit the change (mixed with another small influence() related potential speedup).
While at it, I am just chipping in here with a polite request to check. In summary.lm objects we have for the 'df' component two assignments:
case p = 0: ans$df <- c(0L, n, length(ans$aliased))
any other p: ans$df <- c(p, rdf, NCOL(Qr$qr))
I believe, for p!=0, p* = NCOL(Qr$qr) gives the right answer as any variables belonging to aliased coefficients are kept in Qr$qr (same goes for p=0 with length(ans$aliased). Maybe we want to be a little bit more consistent here and have length(ans$aliased) for p!=0 here as well? It seems a little more expressive to me.
Mhh, also in summary.lm we have for p!=0:
n <- NROW(Qr$qr)
and later again a call to
ans$df <- c(p, rdf, NCOL(Qr$qr))
that 'n' seems to be used in the adjusted R-squared calculation:
ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)
...but should it be the the number of coefficients without the aliased ones?
Ok, forget that last comment. Ver sorry, I got all mixed up by NCOL and NROW...