Created attachment 1759 [details] R code to reproduce behavior I have found (with Rob Hyndman's help) what I believe to be a bug in the "arima" function. I am using this function to fit a series of ARIMA(p,1,q) models, both with and without drift, and then using the corrected AIC to select p, q, and whether to include a drift. Nested models should never have higher log-likelihoods. However, the log-likelihood values computed by the "arima" function do not have this property. This can lead to a sub-optimal model being selected. I think the problem is that the arima function is using a REML estimate for the variance parameter, even when method = "ML" is specified. See the example below. I have also attached the source code to reproduce this behavior. -- # read in a time series (log of Google daily closing price from May 2, 2005 # to December 29, 2006): log.GOOG <- log(ts(read.csv("http://ptrckprry.com/course/forecasting/data/google.csv")[,1])) time <- seq_along(log.GOOG) # Fit two models, one with drift (fit1), and one without (fit0). # The log-likelihood for the model with drift (fit1) should be higher, but it is not. (fit0 <- arima(log.GOOG, c(0, 1, 0), method="ML")) ## ## Call: ## arima(x = log.GOOG, order = c(0, 1, 0), method = "ML") ## ## ## sigma^2 estimated as 0.0004203: log likelihood = 1036.72, aic = -2071.45 (fit1 <- arima(log.GOOG, c(0, 1, 0), xreg=time, method="ML")) ## ## Call: ## arima(x = log.GOOG, order = c(0, 1, 0), xreg = time, method = "ML") ## ## Coefficients: ## time ## 0.0017 ## s.e. 0.0010 ## sigma^2 estimated as 0.0004182: log likelihood = 1035.76, aic = -2067.52 # This is not a problem with the optimization routine: drift <- seq(-0.01, 0.01, length=200) loglik <- sapply(drift, function(d) arima(log.GOOG, c(0, 1, 0), xreg=time, method="ML", fixed=d)$loglik) plot(drift, loglik, type="l") abline(h=fit1$loglik) # I think the problem is that the arima function is using a REML estimate for # the variance: n.used <- length(diff(log.GOOG)) var(diff(log.GOOG)) # REML ## [1] 0.0004182413 var(diff(log.GOOG)) * (n.used - 1) / n.used # ML ## [1] 0.0004172454 fit1$sigma2 # same as the REML estimate ## [1] 0.0004182413

Created attachment 1760 [details] Data file

Here's a simpler test case that reproduces the behavior: --- set.seed(0) n <- 5 x <- cumsum(rnorm(n, sd=0.01)) (fit0 <- arima(x, c(0,1,0), method="ML")) ## Call: ## arima(x = x, order = c(0, 1, 0), method = "ML") ## ## ## sigma^2 estimated as 9.164e-05: log likelihood = 12.92, aic = -23.84 (fit1 <- arima(x, c(0,1,0), method="ML", xreg=1:n)) ## Call: ## arima(x = x, order = c(0, 1, 0), xreg = 1:n, method = "ML") ## ## Coefficients: ## 1:n ## 0.0067 ## s.e. 0.0040 ## ## sigma^2 estimated as 6.186e-05: log likelihood = 10.71, aic = -17.42 fit1$sigma2 ## [1] 6.186389e-05 var(diff(x)) # REML ## [1] 6.186389e-05 --- The log likelihood should be higher for fit1, but it is not. As with the log.GOOG example, it appears that fit1 is using a REML estimate for the innovation variance instead of an ML estimate. A work-around to this issue is to manually difference the series. This call gives the correct estimates: --- # same as fit0: (fit0.alt <- arima(diff(x), c(0,0,0), include.mean=FALSE, method="ML")) ## Call: ## arima(x = diff(x), order = c(0, 0, 0), include.mean = FALSE, method = "ML") ## ## ## sigma^2 estimated as 9.164e-05: log likelihood = 12.92, aic = -23.84 # should be the same as fit1, but isn't; fit1.alt is correct: (fit1.alt <- arima(diff(x), c(0,0,0), method="ML")) ## Call: ## arima(x = diff(x), order = c(0, 0, 0), method = "ML") ## ## Coefficients: ## intercept ## 0.0067 ## s.e. 0.0034 ## ## sigma^2 estimated as 4.64e-05: log likelihood = 14.28, aic = -24.56 fit1.alt$sigma2 ## [1] 4.639792e-05 var(diff(x)) * (n-2) / (n-1) # ML ## [1] 4.639792e-05

Even more evidence that there is a bug in the "arima" function: arima0 gives the correct answer in the last example. --- set.seed(0) n <- 5 x <- cumsum(rnorm(n, sd=0.01)) arima0(x, c(0,1,0), method="ML", xreg=1:n) ## ## Call: ## arima0(x = x, order = c(0, 1, 0), xreg = 1:n, method = "ML") ## ## Coefficients: ## xreg1 ## 0.0067 ## s.e. 0.0034 ## ## sigma^2 estimated as 4.64e-05: log likelihood = 14.28, aic = -24.56

Here's a fix for the bug: In the arima function (from stats/R/arima.R), replace if(fit$rank == 0L) { ## Degenerate model. Proceed anyway so as not to break old code fit <- lm(x ~ xreg - 1, na.action = na.omit) } n.used <- sum(!is.na(resid(fit))) - length(Delta) with if(fit$rank == 0L) { ## Degenerate model. Proceed anyway so as not to break old code fit <- lm(x ~ xreg - 1, na.action = na.omit) n.used <- sum(!is.na(resid(fit))) - length(Delta) } else { n.used <- sum(!is.na(resid(fit))) }

By the way, it looks like the bug might have gotten introduced in the patch suggested in Comment 4 from https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15396

There has been a thread about this on R-devel, starting here https://stat.ethz.ch/pipermail/r-devel/2015-May/071206.html Peter Dalgaard raised important points and had proposed a "more surely" back-compatible approach here, https://stat.ethz.ch/pipermail/r-devel/2015-May/071211.html, using two lm() calls, a differenced one for the initial beta coef, and another for counting the non-NA residuals. whereas Patrick Perry proposed a version where n.used is determined w/o an extra lm() call {apart from the 0-rank case}. For now, I've prefered a version of PPs last proposal, and committed it for R-devel (svn rev 68424) and R-patched (to become R 3.2.1).