|Summary:||plot.lm cooks distrance contour|
|Product:||R||Reporter:||Randy Lai <randy.cs.lai>|
|Severity:||normal||CC:||k_bugzilla, maechler, pd.mes|
Description Randy Lai 2016-10-07 20:16:03 UTC
Created attachment 2170 [details] patch plot.lm lm.SR <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings) plot(lm.SR, which = 6) The plot shows that Japan has a standard residual between 0.5 and 1. But the actual value should be > rstandard(lm.SR)['Japan'] Japan 1.575955 It is caused by the incorrect formula used in computing the standard residuals. Check the attached patch.
Comment 1 Peter Dalgaard 2016-10-11 08:46:12 UTC
To be specific, the cause is that the divisor p in Cook's distance has been omitted. (The proportionality factor is rstd/p rather than rstd). The patch seems sensible. Side question: cooks.distance.lm has p <- model$rank plot.lm has p <- length(coef(x)) which won't normally differ, but can they differ? If they can, which one is correct? I think they ought to be consistent.
Comment 2 Martin Maechler 2016-10-11 16:05:01 UTC
(In reply to Peter Dalgaard from comment #1) > To be specific, the cause is that the divisor p in Cook's distance has been > omitted. (The proportionality factor is rstd/p rather than rstd). The patch > seems sensible. Thank you Peter, for checking. > > Side question: > > cooks.distance.lm has p <- model$rank > plot.lm has p <- length(coef(x)) > > which won't normally differ, but can they differ? If they can, which one is > correct? I think they ought to be consistent. Yes, they can differ quite a bit as soon as you have singularities. I'm pretty sure model$rank is correct not the least because that's also used by summary.lm(). That, BTW, returns a 'df' component which is (from help page): df: degrees of freedom, a 3-vector (p, n-p, p*), the first being the number of non-aliased coefficients, the last being the total number of coefficients. and p is rank whereas p* is length(coef(.)). As I am listed as one "perpetrator" co-authoring this monster-long R function, I will commit the change (mixed with another small influence() related potential speedup). Martin
Comment 3 k_bugzilla 2016-10-12 11:37:04 UTC
While at it, I am just chipping in here with a polite request to check. In summary.lm objects we have for the 'df' component two assignments: case p = 0: ans$df <- c(0L, n, length(ans$aliased)) any other p: ans$df <- c(p, rdf, NCOL(Qr$qr)) I believe, for p!=0, p* = NCOL(Qr$qr) gives the right answer as any variables belonging to aliased coefficients are kept in Qr$qr (same goes for p=0 with length(ans$aliased). Maybe we want to be a little bit more consistent here and have length(ans$aliased) for p!=0 here as well? It seems a little more expressive to me.
Comment 4 k_bugzilla 2016-10-12 11:43:59 UTC
Mhh, also in summary.lm we have for p!=0: n <- NROW(Qr$qr) and later again a call to ans$df <- c(p, rdf, NCOL(Qr$qr)) that 'n' seems to be used in the adjusted R-squared calculation: ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf) ...but should it be the the number of coefficients without the aliased ones?
Comment 5 k_bugzilla 2016-10-12 11:47:03 UTC
Ok, forget that last comment. Ver sorry, I got all mixed up by NCOL and NROW...