Bug 15396 - Initialization of regression coefficients in arima()
Initialization of regression coefficients in arima()
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Analyses
R 3.0.1
x86_64/x64/amd64 (64-bit) Linux
: P5 minor
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-07-18 01:41 UTC by Rob Hyndman
Modified: 2013-07-31 09:44 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Rob Hyndman 2013-07-18 01:41:27 UTC
There is an occasional problem with arima() when an xreg argument is used, due to poor initialization of the regression coefficients. The following example demonstrates the problem.

### Example

xreg <- structure(c(11.8650181597559, 11.8650181597559, 11.8870634059787, 
                      11.8870634059787, 11.8870634059787, 11.915269485325, 11.915269485325, 
                      11.915269485325, 11.8947445293104, 11.8947445293104, 11.8947445293104, 
                      11.915018895356, 11.915018895356, 11.915018895356, 11.9352188467266, 
                      11.9352188467266, 11.9352188467266, 11.9722546158146, 11.9722546158146, 
                      11.9722546158146, 11.9535455536985, 11.9535455536985, 11.9535455536985, 
                      11.9738078300087, 11.9738078300087, 11.9738078300087, 12.0003209339106, 
                      12.0003209339106, 12.0003209339106, 12.0375218352713, 12.0375218352713, 
                      12.0375218352713, 12.010324352181, 12.010324352181, 12.010324352181, 
                      12.0341345930829, 12.0341345930829, 12.0341345930829, 12.0614410154489, 
                      12.0614410154489, 12.0614410154489, 12.0981922910769, 12.0981922910769, 
                      12.0981922910769, 12.0618013094381, 12.0618013094381, 12.0618013094381, 
                      12.0783462758711, 12.0783462758711, 12.0783462758711, 12.1019665279712, 
                      12.1019665279712, 12.1019665279712, 12.1399987240463, 12.1399987240463, 
                      12.1399987240463, 12.1162921653836, 12.1162921653836, 12.1162921653836, 
                      12.1365142570258, 12.1365142570258, 12.1365142570258, 12.1630134280697, 
                      12.1630134280697, 12.1630134280697, 12.1964441062299, 12.1964441062299, 
                      12.1964441062299, 12.182141909305, 12.182141909305, 12.182141909305, 
                      12.1990221826593, 12.1990221826593, 12.1990221826593, 12.2262020168253, 
                      12.2262020168253, 12.2262020168253, 12.2593410214927, 12.2593410214927, 
                      12.2593410214927, 12.2450758670852, 12.2450758670852, 12.2450758670852, 
                      12.2599972250487, 12.2599972250487, 12.2599972250487, 12.2878172887049, 
                      12.2878172887049, 12.2878172887049, 12.3190750871493, 12.3190750871493, 
                      12.3190750871493, 12.3044209328813, 12.3044209328813, 12.3044209328813
), .Tsp = c(2005.08333333333, 2012.91666666667, 12), class = "ts")

x <- structure(c(19.9456084349343, 20.0118531543317, 19.9694121405293, 
            20.0269403445, 19.9665297814722, 19.9886518310909, 19.9843740634432, 
            19.9854850910087, 20.0392687667526, 19.9881135983641, 19.9728880162677, 
            19.9368169683714, 19.8638433435654, 19.9536088578171, 19.9120953588195, 
            19.9822707031469, 19.9528310456123, 20.0249599010602, 20.0404218956532, 
            20.0723784804143, 20.1360801238366, 20.0950654657256, 20.0897086375468, 
            20.086784260154, 20.0222370556615, 20.214468643476, 20.1548155785683, 
            20.1934332216383, 20.12957310507, 20.1870221415883, 20.1983550073117, 
            20.1816695974089, 20.2179157512415, 20.1849731691413, 20.1720476816048, 
            20.1805286128489, 20.099752342315, 20.2043879626502, 20.0720847970307, 
            20.1037077831874, 20.0617733341758, 20.1091107669282, 20.0990322063309, 
            20.0787723459818, 20.0825957734847, 20.0332210677286, 20.0251887218254, 
            20.0631861464734, 20.0045791603852, 20.0901479216967, 20.1860107900427, 
            20.2267021852532, 20.1995397655705, 20.2350271660085, 20.2516836220321, 
            20.2966895152579, 20.2866488641703, 20.2281625356505, 20.2576557823896, 
            20.2482302869357, 20.194397933545, 20.2740461387681, 20.2641027560996, 
            20.340740132862, 20.259726077587, 20.3158067518916, 20.3103720523742, 
            20.3632800987466, 20.3596149579674, 20.3423179573429, 20.3879310672488, 
            20.5610072482633, 20.4870208589472, 20.56862117258, 20.565487980054, 
            20.6266099106437, 20.5939724361097, 20.6529112690121, 20.6449391264395, 
            20.6203997717969, 20.6625579988058, 20.6166711113346, 20.6483244516988, 
            20.6491414520427, 20.5162838288319, 20.6627867521469, 20.6440973568406, 
            20.7046055969827, 20.6915307682507, 20.7475669969293, 20.7705138575836, 
            20.7080794204999, 20.7655121472943, 20.7582940008166, 20.8056828380058
), .Tsp = c(2005.08333333333, 2012.91666666667, 12), class = "ts")

arima(x, xreg=xreg, order=c(1,1,1), seasonal=c(1,0,1))
arima(diff(x), xreg=diff(xreg), order=c(1,0,1), seasonal=c(1,0,1), include.mean=FALSE)

### End of example

These two models should be essentially the same, but they aren't. The problem arises because the initialization of the regression coefficient uses lm on the undifferenced data, and the estimation is not consistent when the residuals are non-stationary. Sometimes it doesn't matter as the optimizer finds the global optimum in any case, but sometimes the very poor initial estimates lead to a local optimum.

There is a relatively easy fix. Simply replace line 195 in arima.R with the following:

### Proposed change to line 195 of arima.R

dx <- x
dxreg <- xreg
if(order[2L] > 0) {
  dx <- diff(dx, 1, order[2L])
  dxreg <- diff(dxreg, 1, order[2L])
}
if(seasonal$period > 1 & seasonal$order[2L] > 0) {
  dx <- diff(dx, seasonal$period, seasonal$order[2L])
  dxreg <- diff(dxreg, seasonal$period, seasonal$order[2L])
}
fit <- lm(dx ~ dxreg - 1, na.action = na.omit)

### End of proposed change

Then the lm regression is applied to the differenced data, so the residuals are stationary (assuming the specified model is ok), giving a consistent estimate, and so the optimizer has a better chance of working.
Comment 1 Brian Ripley 2013-07-22 09:37:33 UTC
Fixed.

BTW, using attachments avoids problems with HTML.
Comment 2 Brian Ripley 2013-07-22 12:15:27 UTC
Unfortunately the revised version fails in caschrono:

Errors in running code in vignettes:
when running code in ‘Anx5.Rnw’
  ...
> x1 = as.matrix(seq(1, length(y2)))

> x2 = as.matrix(seq(1, length(y2))^2)

> (m1 = Arima(y2, order = c(1, 1, 0), seasonal = list(order = c(0, 
+     1, 1), period = 12), xreg = x1))

  When sourcing ‘Anx5.R’:
Error: non-finite value supplied by optim

Please take a look. (I have backed this out of R-patched.)
Comment 3 Rob Hyndman 2013-07-23 05:49:30 UTC
OK. That is being caused by a degenerate model being fitted (where the standard error of the coefficient will always be infinite). The author is attempting to fit a model with a time trend, but including both first and seasonal differencing, which makes the time trend redundant.

Even before my proposed fix, the ARIMA model was returning NaN for the trend coefficient. With the proposed fix, the problem is simply being captured earlier.

It *should* be returned as an error, or at least as an estimate with infinite variance.

Let me think about how to handle that more gracefully and get back to you.


(In reply to comment #2)
> Unfortunately the revised version fails in caschrono:
> 
> Errors in running code in vignettes:
> when running code in ‘Anx5.Rnw’
>   ...
> > x1 = as.matrix(seq(1, length(y2)))
> 
> > x2 = as.matrix(seq(1, length(y2))^2)
> 
> > (m1 = Arima(y2, order = c(1, 1, 0), seasonal = list(order = c(0, 
> +     1, 1), period = 12), xreg = x1))
> 
>   When sourcing ‘Anx5.R’:
> Error: non-finite value supplied by optim
> 
> Please take a look. (I have backed this out of R-patched.)
Comment 4 Rob Hyndman 2013-07-31 06:11:10 UTC
Here is an alternative that will proceed with the existing approach if the model on differenced data is degenerate. 

Replace line 195 of arima.R with the following:

dx <- x
dxreg <- xreg
if(order[2L] > 0) {
  dx <- diff(dx, 1, order[2L])
  dxreg <- diff(dxreg, 1, order[2L])
}
if(seasonal$period > 1 & seasonal$order[2L] > 0) {
  x <- diff(dx, seasonal$period, seasonal$order[2L])
  dxreg <- diff(dxreg, seasonal$period, seasonal$order[2L])
}
if(length(dx) > ncol(dxreg))
  fit <- lm(dx ~ dxreg - 1, na.action = na.omit)
else
  fit <- list(rank=0)
if(fit$rank==0) {
  # Degenerate model. Proceed anyway so as not to break old code
  fit <- lm(x ~ xreg - 1, na.action = na.omit)
}


(In reply to comment #2)
> Unfortunately the revised version fails in caschrono:
> 
> Errors in running code in vignettes:
> when running code in ‘Anx5.Rnw’
>   ...
> > x1 = as.matrix(seq(1, length(y2)))
> 
> > x2 = as.matrix(seq(1, length(y2))^2)
> 
> > (m1 = Arima(y2, order = c(1, 1, 0), seasonal = list(order = c(0, 
> +     1, 1), period = 12), xreg = x1))
> 
>   When sourcing ‘Anx5.R’:
> Error: non-finite value supplied by optim
> 
> Please take a look. (I have backed this out of R-patched.)
Comment 5 Brian Ripley 2013-07-31 09:44:31 UTC
Thanks, incorporated now and ported to R-patched.