Bug 16070 - terms function mishandling case when intercept is excluded
Summary: terms function mishandling case when intercept is excluded
Status: REOPENED
Alias: None
Product: R
Classification: Unclassified
Component: Models (show other bugs)
Version: R 3.1.1
Hardware: Other Other
: P5 major
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2014-11-11 16:46 UTC by Pat O'Reilly
Modified: 2014-11-21 22:05 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pat O'Reilly 2014-11-11 16:46:50 UTC
Hi, 

R documentation describes the factors attribute of the terms.object as follows: 

A matrix of variables by terms showing which variables appear in which terms. 
The entries are 0 if the variable does not occur in the term, 1 if it does occur 
and should be coded by contrasts, and 2 if it occurs and should be coded via 
dummy variables for all levels (as when an intercept or lower-order term is 
missing). If there are no terms other than an intercept and offsets, this is 
numeric(0). 
(http://stat.ethz.ch/R-manual/R-patched/library/stats/html/terms.object.html) 

In the example below, I would expect Species to have a value of 2 since the 
intercept is omitted. Indeed, when using model.matrix it is clear that Species 
has been coded with dummy variables for all three levels. 

f <- ~ -1 + Species 

attr(terms(f, data=iris), "factors") 
#        Species 
#Species       1 

levels(iris$Species) 
#[1] "setosa"     "versicolor" "virginica" 

colnames(model.matrix(f, iris)) 
#[1] "Speciessetosa"     "Speciesversicolor" "Speciesvirginica" 


I think this is a bug in the terms function.

Many thanks in advance, 

Pat O'Reilly
Comment 1 Duncan Murdoch 2014-11-13 15:07:41 UTC
Do you know of any model fits where the wrong terms are used because of this?  Maybe it's just a documentation error.
Comment 2 Peter Dalgaard 2014-11-13 16:05:06 UTC
As far as I can tell, the documentation has the right intention. However, I don't think the attribute is ever used except in interaction terms --- for main effects, one just looks at whether the model contains an intercept. I haven't actually checked the code, though.

Assuming that I am right, matching the documentation to the code is probably possible ("2 if it occurs in an interaction term and...") but it might be clearer to fix the code.
Comment 3 Pat O'Reilly 2014-11-13 17:18:36 UTC
I've not seen this affect the fitting of models but I intend to use the terms object as a source of metadata for a model matrix. If the factors attribute is correct, one can easily work from left to right through the factors matrix to determine what each column of the associated model matrix means. For example column 1 being Age interacted with Gender = "Male". If the factor matrix is incorrect then this becomes challenging.
Comment 4 Duncan Murdoch 2014-11-14 18:24:17 UTC
The calculation for this is in function TermCode in model.c in the stats package.
It says that the calculation follows the heuristic on p. 38 of the white book, 
but on checking, I see that it doesn't -- the case of no intercept is described there and our code doesn't do what it says.  I'll fix it.
Comment 5 Duncan Murdoch 2014-11-14 20:32:39 UTC
Now fixed in R-devel, soon in R-patched.
Comment 6 Duncan Murdoch 2014-11-16 13:35:25 UTC
This change caused several packages to fail their tests, and has been backed out.  We're discussing what to do next.
Comment 7 Pat O'Reilly 2014-11-21 13:37:14 UTC
Duncan,

I tried your fix and I think it is incorrect.

    f <- ~ -1 + Petal.Length*Species

    attr(terms(f, data=iris), "factors")
    #             Petal.Length Species Petal.Length:Species
    #Petal.Length            2       0                    1
    #Species                 0       1                    1

Since Species is the first factor, this needs to be coded with dummy variables.


The modelmatrix function works because although it uses a terms object to build the design matrix, it fixes the factors pattern matrix first.

Line 451 in model.c:

    /* If there is no intercept we look through the factor pattern */
    /* matrix and adjust the code for the first factor found so that */
    /* it will be coded by dummy variables rather than contrasts. */

    if (!intrcept) {
        for (j = 0; j < nterms; j++) {
            for (i = risponse; i < nVar; i++) {
                if (INTEGER(nlevs)[i] > 1
                && INTEGER(factors)[i + j * nVar] > 0) {
                    INTEGER(factors)[i + j * nVar] = 2;
                    goto alldone;
                }
            }
        }
    }
Comment 8 Duncan Murdoch 2014-11-21 19:44:20 UTC
Pat:  What change did you try?  Currently there are no changes to the code in R-devel or R-patched, as noted in comment 6.  R-devel currently has updated documentation to explain the current behaviour, but the behaviour hasn't changed from earlier code.
Comment 9 Pat O'Reilly 2014-11-21 22:05:22 UTC
Duncan, I found your fix on github:

https://github.com/wch/r-source/blob/5722f120a941741a710ed965d90ddb41d03db9d2/src/library/stats/src/model.c

Would it be a correct assessment to say that the no-intercept adjustment currently in the modelmatrix function should really be in either termcode or termsform?