Bug 16278 - nested arima model has higher log-likelihood (using REML instead of ML?)
Summary: nested arima model has higher log-likelihood (using REML instead of ML?)
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Models (show other bugs)
Version: R 3.1.2
Hardware: x86_64/x64/amd64 (64-bit) OS X Mavericks
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-03-23 19:19 UTC by Patrick Perry
Modified: 2015-12-14 13:48 UTC (History)
1 user (show)

See Also:


Attachments
R code to reproduce behavior (983 bytes, text/plain)
2015-03-23 19:19 UTC, Patrick Perry
Details
Data file (3.14 KB, text/csv)
2015-03-23 19:20 UTC, Patrick Perry
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick Perry 2015-03-23 19:19:49 UTC
Created attachment 1759 [details]
R code to reproduce behavior

I have found (with Rob Hyndman's help) what I believe to be a bug in the "arima" function.

I am using this function to fit a series of ARIMA(p,1,q) models, both with and without drift, and then using the corrected AIC to select p, q, and whether to include a drift.  Nested models should never have higher log-likelihoods.  However, the log-likelihood values computed by the "arima" function do not have this property.  This can lead to a sub-optimal model being selected.  I think the problem is that the arima function is using a REML estimate for the variance parameter, even when method = "ML" is specified.

See the example below.  I have also attached the source code to reproduce this behavior.

--


# read in a time series (log of Google daily closing price from May 2, 2005
# to December 29, 2006):
log.GOOG <- log(ts(read.csv("http://ptrckprry.com/course/forecasting/data/google.csv")[,1]))
time <- seq_along(log.GOOG)


# Fit two models, one with drift (fit1), and one without (fit0).
# The log-likelihood for the model with drift (fit1) should be higher, but it is not.

(fit0 <- arima(log.GOOG, c(0, 1, 0), method="ML"))
##
## Call:
## arima(x = log.GOOG, order = c(0, 1, 0), method = "ML")
##
##
## sigma^2 estimated as 0.0004203: log likelihood = 1036.72, aic = -2071.45


(fit1 <- arima(log.GOOG, c(0, 1, 0), xreg=time, method="ML"))
##
## Call:
## arima(x = log.GOOG, order = c(0, 1, 0), xreg = time, method = "ML")
##
## Coefficients:
##         time
##       0.0017
## s.e.  0.0010
## sigma^2 estimated as 0.0004182: log likelihood = 1035.76, aic = -2067.52


# This is not a problem with the optimization routine:
drift <- seq(-0.01, 0.01, length=200)
loglik <- sapply(drift, function(d)
    arima(log.GOOG, c(0, 1, 0), xreg=time, method="ML", fixed=d)$loglik)
plot(drift, loglik, type="l")
abline(h=fit1$loglik)


# I think the problem is that the arima function is using a REML estimate for
# the variance:
n.used <- length(diff(log.GOOG))

var(diff(log.GOOG)) # REML
## [1] 0.0004182413

var(diff(log.GOOG)) * (n.used - 1) / n.used # ML
## [1] 0.0004172454

fit1$sigma2 # same as the REML estimate
## [1] 0.0004182413
Comment 1 Patrick Perry 2015-03-23 19:20:41 UTC
Created attachment 1760 [details]
Data file
Comment 2 Patrick Perry 2015-03-26 02:50:28 UTC
Here's a simpler test case that reproduces the behavior:

---

set.seed(0)
n <- 5
x <- cumsum(rnorm(n, sd=0.01))


(fit0 <- arima(x, c(0,1,0), method="ML"))
## Call:
## arima(x = x, order = c(0, 1, 0), method = "ML")
##
##
## sigma^2 estimated as 9.164e-05:  log likelihood = 12.92,  aic = -23.84


(fit1 <- arima(x, c(0,1,0), method="ML", xreg=1:n))
## Call:
## arima(x = x, order = c(0, 1, 0), xreg = 1:n, method = "ML")
##
## Coefficients:
##          1:n
##       0.0067
## s.e.  0.0040
##
## sigma^2 estimated as 6.186e-05:  log likelihood = 10.71,  aic = -17.42


fit1$sigma2
## [1] 6.186389e-05

var(diff(x)) # REML
## [1] 6.186389e-05

---

The log likelihood should be higher for fit1, but it is not.  As with the log.GOOG example, it appears that fit1 is using a REML estimate for the innovation variance instead of an ML estimate.

A work-around to this issue is to manually difference the series.  This call gives the correct estimates:

---

# same as fit0:

(fit0.alt <- arima(diff(x), c(0,0,0), include.mean=FALSE, method="ML"))
## Call:
## arima(x = diff(x), order = c(0, 0, 0), include.mean = FALSE, method = "ML")
##
##
## sigma^2 estimated as 9.164e-05:  log likelihood = 12.92,  aic = -23.84

# should be the same as fit1, but isn't; fit1.alt is correct:

(fit1.alt <- arima(diff(x), c(0,0,0), method="ML"))
## Call:
## arima(x = diff(x), order = c(0, 0, 0), method = "ML")
##
## Coefficients:
##       intercept
##          0.0067
## s.e.     0.0034
##
## sigma^2 estimated as 4.64e-05:  log likelihood = 14.28,  aic = -24.56

fit1.alt$sigma2
## [1] 4.639792e-05

var(diff(x)) * (n-2) / (n-1) # ML
## [1] 4.639792e-05
Comment 3 Patrick Perry 2015-03-31 18:32:50 UTC
Even more evidence that there is a bug in the "arima" function: arima0 gives the correct answer in the last example.
---

set.seed(0)
n <- 5
x <- cumsum(rnorm(n, sd=0.01))


arima0(x, c(0,1,0), method="ML", xreg=1:n)
##
## Call:
## arima0(x = x, order = c(0, 1, 0), xreg = 1:n, method = "ML")
##
## Coefficients:
##        xreg1
##       0.0067
## s.e.  0.0034
##
## sigma^2 estimated as 4.64e-05:  log likelihood = 14.28,  aic = -24.56
Comment 4 Patrick Perry 2015-04-02 19:19:02 UTC
Here's a fix for the bug:

In the arima function (from stats/R/arima.R), replace

        if(fit$rank == 0L) {
            ## Degenerate model. Proceed anyway so as not to break old code
            fit <- lm(x ~ xreg - 1, na.action = na.omit)
        }
        n.used <- sum(!is.na(resid(fit))) - length(Delta)

with

        if(fit$rank == 0L) {
            ## Degenerate model. Proceed anyway so as not to break old code
            fit <- lm(x ~ xreg - 1, na.action = na.omit)
            n.used <- sum(!is.na(resid(fit))) - length(Delta)
        } else {
            n.used <- sum(!is.na(resid(fit)))
        }
Comment 5 Patrick Perry 2015-04-02 19:50:42 UTC
By the way, it looks like the bug might have gotten introduced in the patch suggested in Comment 4 from

https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15396
Comment 6 Martin Maechler 2015-06-02 07:59:18 UTC
There has been a thread about this on R-devel, starting here
https://stat.ethz.ch/pipermail/r-devel/2015-May/071206.html

Peter Dalgaard raised important points and had proposed a "more surely" back-compatible approach here,
https://stat.ethz.ch/pipermail/r-devel/2015-May/071211.html,
using two lm() calls, a differenced one for the initial beta coef, and another for counting the non-NA residuals.

whereas Patrick Perry proposed a version where n.used is determined w/o an extra lm() call {apart from the 0-rank case}.

For now, I've prefered a version of PPs last proposal, and committed it for R-devel (svn rev 68424) and R-patched (to become R 3.2.1).