Bug 14488 - stats::cor with method="spearman" and use="complete.obs" handles NAs incorrectly
stats::cor with method="spearman" and use="complete.obs" handles NAs incorrectly
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Misc
R 2.12.0
Other Linux
: P3 normal
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-27 12:58 UTC by Simon Anders
Modified: 2014-02-16 11:43 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Anders 2011-01-27 12:58:19 UTC
If I understand the help page for 'cor' correctly, the two modes 'complete.obs' 
and 'pairwise.complete.obs' specify how to deal with correlation 
coefficients when calculating a correlation _matrix_. When calculating a 
single (scalar) correlation coefficient for two data vectors x and y, 
both should give the same result.

For Pearson correlation, this is in fact the case:

> x <- runif( 10 )
> y <- runif( 10 )
> y[5] <- NA

> cor( x, y, use="complete.obs" )
[1] 0.407858
> cor( x, y, use="pairwise.complete.obs" )
[1] 0.407858

For Spearman correlation, we do NOT get the same results

> cor( x, y, method="spearman", use="complete.obs" )
[1] 0.3416009
> cor( x, y, method="spearman", use="pairwise.complete.obs" )
[1] 0.3333333

To see the likely reason for this possible bug, observe how we can recreate both results by swapping the order of the operations 'transforming to ranks' and 'removing missing observations':

> goodobs <- !is.na(x) & !is.na(y)

> cor( rank(x)[goodobs], rank(y)[goodobs] )
[1] 0.3416009
> cor( rank(x[goodobs]), rank(y[goodobs]) )
[1] 0.3333333

I would claim that only the calculation resulting in 0.3333 is a proper 
Spearman correlation, while the line resulting in 0.3416 is not. After 
all, the following is not a complete set of ranks because there are 9 
observations, numbered from 1 to 10, skipping the 3:

> rank(x)[goodobs]
[1] 10  6  8  7  4  5  1  9  2


[Note: I've reported this bug already on the R-devel mailing list but did not get any response there.]
Comment 1 Simon Anders 2011-01-27 13:00:16 UTC
Forgot the sessionInfo; here it is:

> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
  [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
  [7] LC_PAPER=en_US.utf8       LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pspearman_0.2-5 SuppDists_1.1-8

loaded via a namespace (and not attached):
[1] tools_2.12.0
Comment 2 Brian Ripley 2011-03-21 07:42:51 UTC
changed in 2.13.0
Comment 3 Jackie Rosen 2014-02-16 11:43:13 UTC
(spam comment removed)