Bug 15239 - Add 'anyNA', for efficiently checking for missing values.
Summary: Add 'anyNA', for efficiently checking for missing values.
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Misc (show other bugs)
Version: R 2.15.3
Hardware: Other Mac OS X v10.8
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2013-03-19 23:41 UTC by Tim Hesterberg
Modified: 2013-04-24 16:04 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tim Hesterberg 2013-03-19 23:41:52 UTC
I've created an efficient anyNA function, and use it extensively.
I would like to see this (or a better version) incorporated in R.

The standard way to check if an object has any missing values is
  any(is.na(x))
which creates an object the same size as x, before doing any checking.
For a data frame, it creates the logical vectors for all columns
and cbind's them together before checking if any are TRUE.
This is slow.

My implementation uses .Call, and handles lists (including data frames)
but checking one component at a time. 
This is dramatically faster.

The implementation isn't perfect. It isn't a generic function,
and for S4 objects I revert to calling any(is.na(x)).
.Internal might be better than .Call.

The code is in 
  http://www.timhesterberg.net/r-packages/aggregate_1.2.4.1.tar.gz
in
  R/anyNA.R
  man/anyNA.Rd
  src/anyNA.c
  tests/anyNA.t
  inst/doc/benchmark_anyNA.R
Comment 1 Martin Maechler 2013-03-21 09:17:50 UTC
I had talked about this with Tim, @ useR 2012, and promissed some action.
I'm sorry this is too late for 3.0.0, but definitely something I support
and will "sponsor" if noone else beats me to it.

TH should be aware that the challenge is a bit more now with "long vectors" that we'd also want to support.

Martin
Comment 2 Martin Maechler 2013-04-19 18:58:11 UTC
This is now committed, using a new .Primitive  anyNA()  and some testing.
My current plan is to port it to R 3.0.0 patched after some waiting period.

We (R core) have to consider if the Copyright (copied from Tim's original source)
is not problematic to us.

Martin
Comment 3 Henrik Bengtsson 2013-04-22 23:52:17 UTC
Hi,

I noticed from R NEWS that anyNA() is being added to R 3.0.0 patch.  May I suggest to use the name anyMissing() instead of anyNA()?

The reason is that this was discussed a while ago in thread '[Rd] Suggestions to speed up median() and has.na()' on April 11, 2006:

  https://stat.ethz.ch/pipermail/r-devel/2006-April/037209.html

The name originates from what S-Plus calls it and (in that thread) a decision was made at the time (and backed by Duncan Murdoch and Thomas Lumley).  

Rather soon after, anyMissing() was added to Biobase (on Bioconductor) and to matrixStats (on CRAN).  Many other Bioconductor packages are since using anyMissing() to mean exactly "any(is.na(x))" for various other data types/classes [it's a generic].  In other words, it is more or less a de facto convention to use anyMissing().

/Henrik Bengtsson

PS. This thread is marked as 'CLOSED' so I'm not sure whether you'll get a notifaction or not.  If I don't hear back soon, I'll send an email instead.
Comment 4 Martin Maechler 2013-04-23 10:35:47 UTC
(In reply to comment #3)
> Hi,
> 
> I noticed from R NEWS that anyNA() is being added to R 3.0.0 patch.  May I
> suggest to use the name anyMissing() instead of anyNA()?
> 
> The reason is that this was discussed a while ago in thread '[Rd] Suggestions
> to speed up median() and has.na()' on April 11, 2006:
> 
>   https://stat.ethz.ch/pipermail/r-devel/2006-April/037209.html
> 
> The name originates from what S-Plus calls it and (in that thread) a decision
> was made at the time (and backed by Duncan Murdoch and Thomas Lumley).  
> 
> Rather soon after, anyMissing() was added to Biobase (on Bioconductor) and to
> matrixStats (on CRAN).  Many other Bioconductor packages are since using
> anyMissing() to mean exactly "any(is.na(x))" for various other data
> types/classes [it's a generic].  In other words, it is more or less a de facto
> convention to use anyMissing().
> 
> /Henrik Bengtsson

You have a very good point; and indeed, I had forgotten about that thread of seven years ago.   Hence reopened.

I propose to indeed effectuate the  s/anyNA/anyMissing/   change.
Martin
Comment 5 Martin Maechler 2013-04-24 08:34:57 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Hi,
> > 
> > I noticed from R NEWS that anyNA() is being added to R 3.0.0 patch.  May I
> > suggest to use the name anyMissing() instead of anyNA()?
> > 
> > The reason is that this was discussed a while ago in thread '[Rd] Suggestions
> > to speed up median() and has.na()' on April 11, 2006:
> > 
> >   https://stat.ethz.ch/pipermail/r-devel/2006-April/037209.html
> > 
> > The name originates from what S-Plus calls it and (in that thread) a decision
> > was made at the time (and backed by Duncan Murdoch and Thomas Lumley).  
> > 
> > Rather soon after, anyMissing() was added to Biobase (on Bioconductor) and to
> > matrixStats (on CRAN).  Many other Bioconductor packages are since using
> > anyMissing() to mean exactly "any(is.na(x))" for various other data
> > types/classes [it's a generic].  In other words, it is more or less a de facto
> > convention to use anyMissing().
> > 
> > /Henrik Bengtsson
> 
> You have a very good point; and indeed, I had forgotten about that thread of
> seven years ago.   Hence reopened.
> 
> I propose to indeed effectuate the  s/anyNA/anyMissing/   change.
> Martin

and so I did.
The plan is to port this to R 3.0.x .. after a waiting period.
In R-devel (3.1.0 to be), we make use of  anyMissing() extensively now;
and this was a good idea as it revealed that  the case x=NULL gave an error in Tim's original C code.  It now gives FALSE.

Martin
Comment 6 Tim Hesterberg 2013-04-24 15:54:05 UTC
I'll argue for the name 'anyNA' instead of 'anyMissing'.

I'm the guilty party. I'm responsible for the name 'anyMissing', when I wrote the original function as part of the S+MissingData package. At that time I didn't give the name much thought - we were doing a "misssing data" package, so I called it 'anyMissing'. But "missing" is ambiguous - it could refer to missing values, or to missing arguments (like the 'missing' function). 

So when I created a new function for R, I reconsidered, and decided that 'anyNA' is a better choice.

My first choice was 'any.na' (like 'is.na') but that gets interpreted as an S3 method for 'any' (by R CMD check, and perhaps also in dispatching).

If you do use 'anyNA', or fix the problems with 'any.na' and use that name, then Biobase and Bioconductor could deprecate the old name and switch over time to the new name.

It would be better to switch now, before anyMissing is included in an R release.
Comment 7 Tim Hesterberg 2013-04-24 16:04:56 UTC
About the copyright - I was authorized to release the aggregate package (which includes anyNA) as open source, and did submit aggregate to CRAN (it was rejected because it uses .Internal calls).
If there is a problem with the copyright please let me know.

(In reply to comment #2)
> We (R core) have to consider if the Copyright (copied from Tim's original
> source)
> is not problematic to us.
> 
> Martin