Bug 17161 - plot.lm cooks distrance contour
Summary: plot.lm cooks distrance contour
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Analyses (show other bugs)
Version: R 3.3.*
Hardware: Other Other
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-10-07 20:16 UTC by Randy Lai
Modified: 2016-10-12 11:47 UTC (History)
3 users (show)

See Also:


Attachments
patch plot.lm (736 bytes, patch)
2016-10-07 20:16 UTC, Randy Lai
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Randy Lai 2016-10-07 20:16:03 UTC
Created attachment 2170 [details]
patch plot.lm

lm.SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
plot(lm.SR, which = 6)

The plot shows that Japan has a standard residual between 0.5 and 1.

But the actual value should be
> rstandard(lm.SR)['Japan']
   Japan 
1.575955 


It is caused by the incorrect formula used in computing the standard residuals. Check the attached patch.
Comment 1 Peter Dalgaard 2016-10-11 08:46:12 UTC
To be specific, the cause is that the divisor p in Cook's distance has been omitted. (The proportionality factor is rstd/p rather than rstd). The patch seems sensible. 

Side question: 

cooks.distance.lm has p <- model$rank
plot.lm has p <- length(coef(x)) 

which won't normally differ, but can they differ? If they can, which one is correct? I think they ought to be consistent.
Comment 2 Martin Maechler 2016-10-11 16:05:01 UTC
(In reply to Peter Dalgaard from comment #1)
> To be specific, the cause is that the divisor p in Cook's distance has been
> omitted. (The proportionality factor is rstd/p rather than rstd). The patch
> seems sensible. 

Thank you Peter, for checking.
> 
> Side question: 
> 
> cooks.distance.lm has p <- model$rank
> plot.lm has p <- length(coef(x)) 
> 
> which won't normally differ, but can they differ? If they can, which one is
> correct? I think they ought to be consistent.

Yes, they can differ quite a bit as soon as you have singularities.
I'm pretty sure  model$rank is correct  not the least because that's also used by
summary.lm().

That, BTW, returns a 'df' component which is (from help page):

      df: degrees of freedom, a 3-vector (p, n-p, p*), the first being
          the number of non-aliased coefficients, the last being the
          total number of coefficients.

and p is rank  whereas  p* is length(coef(.)).

As I am listed as one "perpetrator" co-authoring this monster-long R function,
I will commit the change (mixed with another small influence() related potential speedup).

Martin
Comment 3 k_bugzilla 2016-10-12 11:37:04 UTC
While at it, I am just chipping in here with a polite request to check. In summary.lm objects we have for the 'df' component two assignments:

case p = 0:  ans$df <- c(0L, n, length(ans$aliased))
any other p: ans$df <- c(p, rdf, NCOL(Qr$qr))

I believe, for p!=0, p* = NCOL(Qr$qr) gives the right answer as any variables belonging to aliased coefficients are kept in Qr$qr (same goes for p=0 with length(ans$aliased). Maybe we want to be a little bit more consistent here and have length(ans$aliased) for p!=0 here as well? It seems a little more expressive to me.
Comment 4 k_bugzilla 2016-10-12 11:43:59 UTC
Mhh, also in summary.lm we have for p!=0:
 n <- NROW(Qr$qr)
and later again a call to
 ans$df <- c(p, rdf, NCOL(Qr$qr))

that 'n' seems to be used in the adjusted R-squared calculation:
 ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)

...but should it be the the number of coefficients without the aliased ones?
Comment 5 k_bugzilla 2016-10-12 11:47:03 UTC
Ok, forget that last comment. Ver sorry, I got all mixed up by NCOL and NROW...