Bug 17415 - simulate.lm uses wrong variance when link != identity
Summary: simulate.lm uses wrong variance when link != identity
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Models (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2018-04-28 16:31 UTC by Alexandre Courtiol
Modified: 2018-05-03 13:08 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandre Courtiol 2018-04-28 16:31:55 UTC
The function simulate.lm() uses wrong variance is some case (e.g. gaussian(link = "log"), but also perhaps when one uses custom prior weights in all lm and glm), which lead to the incorrect simulation.

The bug applies to all versions of R since 2009 (so since I think R 2.8)!!

Minimal reproducible example:

set.seed(1L)
var(y <- rnorm(n = 1000, mean = 10, sd = sqrt(10)))
## 10.71051 --> OK

mod_glm  <- glm(y ~ 1, family = gaussian(link = "log"))
var(simulate(mod_glm)[, 1])
## 0.1166989 --> Wrong: it is way too far from variance in the original data

var(simulate.lm_patched(mod_glm)[, 1]) ## see below
## 11.38721 --> Seems correct

Bug origin:

It has been introduced in commit "update to similate.lm" on the 18/02/2009 by Brian Ripley git-svn-id: https://svn.r-project.org/R/trunk@47956 00db46b3-68df-0310-9c12-caf00c1e9a41

In the definition of simulate.lm() in src/library/stats/R/lm.R

The line

if (!is.null(object$weights)) vars <- vars/object$weights

should instead, very likely be:

if (!is.null(object$prior.weights)) vars <- vars/object$prior.weights

(which is how I create the function simulate.lm_patched used above)

The confusion between "prior.weights" and "weights" is clear in the history of the development of this function:

In commit by Martin Maechler "first cut at simulate(<glm>)" of 13/02/2009
git-svn-id: https://svn.r-project.org/R/trunk@47919 00db46b3-68df-0310-9c12-caf00c1e9a41 , we can see that he specifically used prior weights for some families before Ripley generalized the code:

"binomial" = {
 wts <- object$prior.weights
 if (any(wts %% 1 != 0))
 stop("cannot simulate from non-integer prior.weights")
 rbinom(ntot, size = wts, prob = ftd)/wts
}

The confusion is not surprising because prior weights are called weights in the arguments of the lm call but prior.weights internally, while what is called weights internally is something else (working weights).

Note:

The bug has remained unnoticed because when gaussian(link = "identity") and when the prior weights are all set to 1 (most common situation) then there is no issue.

It would be good if someone that knowns glm better than I do could make sure that the patch is correct.

Best,

Alex
Comment 1 Martin Maechler 2018-05-02 16:40:10 UTC
Thank you, Alex.

I agree this is a bug, and your correction seems fine.
I'm running checks now with a patch (including regression check).
Comment 2 Martin Maechler 2018-05-02 18:29:57 UTC
(In reply to Martin Maechler from comment #1)
 
> ... and your correction seems fine.

actually not: Need to consider *both* $weights and $prior.weights .. and of course it is obvious how
Comment 3 Martin Maechler 2018-05-03 07:45:11 UTC
Required change is only obvious in hindsight:
An lm() and a glm(<gaussian>) must be treated separately here, the first using `weights`, the latter `prior.weights`.

Fixed in R-devel (svn 74683).   
Planned to port to R 3.5.0 patched after a waiting time.
Comment 4 Alexandre Courtiol 2018-05-03 13:08:34 UTC
Thanks a lot Martin for fixing this bug!