Bug 17366 - [PATCH] Support for missing time points in ar.yw.*()
Summary: [PATCH] Support for missing time points in ar.yw.*()
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2017-12-15 01:42 UTC by Pavel N. Krivitsky
Modified: 2017-12-27 19:20 UTC (History)
1 user (show)

See Also:


Attachments
A patch against R trunk stats package source to enable ar.yw.*() functions to handle missing rows in x. (9.05 KB, patch)
2017-12-15 01:42 UTC, Pavel N. Krivitsky
Details | Diff
A patch against R trunk stats package source to enable ar.yw.*() functions to handle missing rows in x. (9.14 KB, patch)
2017-12-15 02:41 UTC, Pavel N. Krivitsky
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Pavel N. Krivitsky 2017-12-15 01:42:02 UTC
Created attachment 2308 [details]
A patch against R trunk stats package source to enable ar.yw.*() functions to handle missing rows in x.

# Overview #

At the moment, if the matrix passed to ar.yw() along with na.action=na.pass, both methods (ar.yw.default() and ar.yw.mts()) check it for NA elements and return an error if any are present.

However, since the Yule-Walker code only depends on the data through the autocovariance returned by a call to acf(), and since acf() handles missing data seamlessly, It should be possible to estimate series with missing values in ar(),  especially if the missingness is row-wise (i.e., either all variables in a row are missing or none).

This has some applications beyond the obvious. For example, coda::effectiveSize() given an mcmc.list calls itself with each of the MCMC replications (and indirectly ar()) and then sums up their effective sizes. However, it would be more accurate to do this jointly for all the chains, so a trick like that of https://stats.stackexchange.com/a/23123 could be used to this end.


# Steps to Reproduce #

y <- rnorm(100)
ar(y) # works
y <- c(y, NA) # shouldn't change result
ar(y, na.action=na.pass) # error


# Expected behaviour #

Both examples return the same result.


# Actual behaviour #

The first call works, the second stops with 
Error in ar.yw.default(x, aic = aic, order.max = order.max, na.action = na.action,  : 
  NAs in 'x'


# Patch #

The attached patch against the current R trunk's stats package source modifies ar.yw.default() and ar.yw.mts() in the following ways:
* It replaces a check for any NAs with a check for whether different columns in a given row have different NA status.
* It introduces a new variable, n.obs, to keep track of the number of data rows and replaces n.used where appropriate. (n.used is still used for the number of rows in the input matrix.)

In addition, the patch modifies all ar.*() functions in the following ways:
* It adds an n.obs element to the return list. For ar.yw.*() functions, it may be different, but for the others, it's simply copied from n.used.

In documentation, na.action= argument's description is updated, as is the return value list.
Comment 1 Pavel N. Krivitsky 2017-12-15 02:41:34 UTC
Created attachment 2309 [details]
A patch against R trunk stats package source to enable ar.yw.*() functions to handle missing rows in x.

Forgot to add an na.action= argument to one of the acf() calls.
Comment 2 Martin Maechler 2017-12-21 09:23:37 UTC
I agree that it it conceptually good to think on how to compute ar() estimates in the case of missing values.

However, just dropping cases seems not good enough to me.  The time structure is completely messed up.  If you drop observation 2 and 3 say, then obs. 1 and 4 are treated as if they'd be immediate neighbours ("lag 1")  which they are not: they differ by "lag 3".

Also, if you really want to do such a dangerous if not misleading thing you could simply use

    ar( na.omit(y) , ...)
instead of
    ar( y, ...)

There are many imputation approaches for time series that would be preferable to such a "na.omit" strategy -- even though simple imputations such as "carry last observation forward" (used in applied finance/economy AFAIK) are still very doubtful.

R's  arima()  *does* allow missing values with method "ML" and uses sophisticated state-space modelling to compute a likelihood inspite of the missings.
Maybe we should a sentence about this to the help page, or even change the 
  stop("NAs in 'x'")
message to something like
  stop("NAs in 'x'; consider using  arima(x, method = \"ML\")")

Yes, arima() only works with univariate x  ... but there CRAN packages with sophisticated methodolgy to work for multivariate AR(I)MA modeling.

I propose to close this possibly as "INVALID" (the code works as documented; the  change is encouraging invalid model building).
Comment 3 Pavel N. Krivitsky 2017-12-21 13:19:21 UTC
(In reply to Martin Maechler from comment #2)

> However, just dropping cases seems not good enough to me.  The time
> structure is completely messed up.  If you drop observation 2 and 3 say,
> then obs. 1 and 4 are treated as if they'd be immediate neighbours ("lag 1")
> which they are not: they differ by "lag 3".

Perhaps I didn't explain what the patch does clearly enough. The patch does not at any point drop the cases with the missing data or alter the the time structure.

Rather, it checks whether the missingness is consistent within each row (because otherwise, the sample sizes are not clear), and if it is, it simply passes the NAs on to the acf() call. The acf() call handles NAs seamlessly by not averaging them into the autocovariance estimate, while preserving the time structure. This estimated autocovariance is then passed to the Yule-Walker equation solver.

Thus, if the missingness is noninformative (say, because observations at randomly selected time points are lost), acf() will consistently estimate the autocovariances and ar.yw.*() would consistently estimate the AR parameters.

> I propose to close this possibly as "INVALID" (the code works as documented;
> the  change is encouraging invalid model building).

I believe that the way in which I propose to handle missing values in ar.yw.*() is consistent with the way missing data are handled elsewhere (by lm(), for example).
Comment 4 Martin Maechler 2017-12-22 15:20:26 UTC
(In reply to Pavel N. Krivitsky from comment #3)
> (In reply to Martin Maechler from comment #2)
> 
> > However, just dropping cases seems not good enough to me.  The time
> > structure is completely messed up.  If you drop observation 2 and 3 say,
> > then obs. 1 and 4 are treated as if they'd be immediate neighbours ("lag 1")
> > which they are not: they differ by "lag 3".
> 
> Perhaps I didn't explain what the patch does clearly enough. The patch does
> not at any point drop the cases with the missing data or alter the the time
> structure.
> 
> Rather, it checks whether the missingness is consistent within each row
> (because otherwise, the sample sizes are not clear), and if it is, it simply
> passes the NAs on to the acf() call. The acf() call handles NAs seamlessly
> by not averaging them into the autocovariance estimate, while preserving the
> time structure. This estimated autocovariance is then passed to the
> Yule-Walker equation solver.
> 
> Thus, if the missingness is noninformative (say, because observations at
> randomly selected time points are lost), acf() will consistently estimate
> the autocovariances and ar.yw.*() would consistently estimate the AR
> parameters.
> 
> > I propose to close this possibly as "INVALID" (the code works as documented;
> > the  change is encouraging invalid model building).
> 
> I believe that the way in which I propose to handle missing values in
> ar.yw.*() is consistent with the way missing data are handled elsewhere (by
> lm(), for example).

I see...  and I am sorry for the complete misinterpretation of the code change.
I also was lead to this misinterpretation because I thought that in the multivariate case it looks stupid that you cannot have partially missing observations in a method that build acf() because there should be an analogue of
   cor(*, use = "pairwise.complete.obs").

but then that breaks the warranty for positive semi-definiteness and that may be  important for Yule-Walker (or not).
Comment 5 Martin Maechler 2017-12-27 19:20:20 UTC
committed to R-devel as rev 73953, 4 days ago