Bug 14207 - Incorrect Kendall's tau for ordered variables
Incorrect Kendall's tau for ordered variables
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Analyses
old
All Linux
: P5 normal
Assigned To: Jitterbug compatibility account
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-02-08 10:42 UTC by Jitterbug compatibility account
Modified: 2010-02-08 23:09 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jitterbug compatibility account 2010-02-08 10:42:59 UTC
From: msa@biostat.mgh.harvard.edu
Full_Name: Marek Ancukiewicz
Version: 2.10.1
OS: Linux
Submission from: (NULL) (74.0.49.2)


Both cor() and cor.test() incorrectly handle ordered variables with 
method="kendall", cor() incorrectly handles ordered variables for 
method="spearman" (method="person" always works correctly, while 
method="spearman" works for cor.test, but not for cor()).

In erroneous calculations these functions ignore the inherent ordering
of the ordered variable (e.g., '9'<'10'<'11') and instead seem to assume 
an alphabetic ordering ('10'<'11'<'9'). 

> cor(9:11,1:3,method="k")
[1] 1
> cor(as.ordered(9:11),1:3,method="k")
[1] -0.3333333
> cor.test(as.ordered(9:11),1:3,method="k")

	Kendall's rank correlation tau

data:  as.ordered(9:11) and 1:3 
T = 1, p-value = 1
alternative hypothesis: true tau is not equal to 0 
sample estimates:
       tau 
-0.3333333 

> cor(9:11,1:3,method="s")
[1] 1
> cor(as.ordered(9:11),1:3,method="s")
[1] -0.5
> cor.test(as.ordered(9:11),1:3,method="s")

	Spearman's rank correlation rho

data:  as.ordered(9:11) and 1:3 
S = 0, p-value = 0.3333
alternative hypothesis: true rho is not equal to 0 
sample estimates:
rho 
  1

Comment 1 Jitterbug compatibility account 2010-02-08 19:23:08 UTC
From: Peter Dalgaard <P.Dalgaard@biostat.ku.dk>
msa@biostat.mgh.harvard.edu wrote:
> Full_Name: Marek Ancukiewicz
> Version: 2.10.1
> OS: Linux
> Submission from: (NULL) (74.0.49.2)
> 
> 
> Both cor() and cor.test() incorrectly handle ordered variables with 
> method="kendall", cor() incorrectly handles ordered variables for 
> method="spearman" (method="person" always works correctly, while 
> method="spearman" works for cor.test, but not for cor()).
> 
> In erroneous calculations these functions ignore the inherent ordering
> of the ordered variable (e.g., '9'<'10'<'11') and instead seem to assume 
> an alphabetic ordering ('10'<'11'<'9'). 

Strictly speaking, not a bug, since the documentation has

       x: a numeric vector, matrix or data frame.

respectively

    x, y: numeric vectors of data values.  ‘x’ and ‘y’ must have the
          same length.

so noone ever claimed that class "ordered" variables should work.

However, the root cause is that as.vector on a factor variable (ordered
or not) converts it to a character vector, hence

> rank(as.vector(as.ordered(9:11)))
[1] 3 1 2

Looks like a simple fix would be to use as.vector(x, "numeric") inside
the definition of cor().


>> cor(9:11,1:3,method="k")
> [1] 1
>> cor(as.ordered(9:11),1:3,method="k")
> [1] -0.3333333
>> cor.test(as.ordered(9:11),1:3,method="k")
> 
> 	Kendall's rank correlation tau
> 
> data:  as.ordered(9:11) and 1:3 
> T = 1, p-value = 1
> alternative hypothesis: true tau is not equal to 0 
> sample estimates:
>        tau 
> -0.3333333 
> 
>> cor(9:11,1:3,method="s")
> [1] 1
>> cor(as.ordered(9:11),1:3,method="s")
> [1] -0.5
>> cor.test(as.ordered(9:11),1:3,method="s")
> 
> 	Spearman's rank correlation rho
> 
> data:  as.ordered(9:11) and 1:3 
> S = 0, p-value = 0.3333
> alternative hypothesis: true rho is not equal to 0 
> sample estimates:
> rho 
>   1
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)              FAX: (+45) 35327907

Comment 2 Jitterbug compatibility account 2010-02-08 20:07:01 UTC
From: Marek Ancukiewicz <msa@biostat.mgh.harvard.edu>
PARTS: 2

Dear Peter,

Thank you. Although the documentation does mention numeric variables,
one would intuitively expect cor() and cor.test() to work for ordered
factors with methods "kendall" and "spearman". After all, these are
nonparametric procedures, defined for ordinal scales, and the only
information they need are ranks (the same should be true for
wilcox.test()).

So even if this is, strictly speaking, not a bug I would strongly
suggest extending cor() and cor.test() to work with ordered factors for
Kendall's and Spearman's correlations (although this would not make much
sense for Pearson's correlation). It looks like the change should be
very easy.

Marek Ancukiewicz

> Date: Mon, 08 Feb 2010 14:23:08 +0100
> From: Peter Dalgaard <P.Dalgaard@biostat.ku.dk>
> Cc: r-devel@stat.math.ethz.ch, R-bugs@r-project.org
> 
> msa@biostat.mgh.harvard.edu wrote:
> > Full_Name: Marek Ancukiewicz
> > Version: 2.10.1
> > OS: Linux
> > Submission from: (NULL) (74.0.49.2)
> > 
> > 
> > Both cor() and cor.test() incorrectly handle ordered variables with 
> > method="kendall", cor() incorrectly handles ordered variables for 
> > method="spearman" (method="person" always works correctly, while 
> > method="spearman" works for cor.test, but not for cor()).
> > 
> > In erroneous calculations these functions ignore the inherent ordering
> > of the ordered variable (e.g., '9'<'10'<'11') and instead seem to assume 
> > an alphabetic ordering ('10'<'11'<'9'). 
> 
> Strictly speaking, not a bug, since the documentation has
> 
>        x: a numeric vector, matrix or data frame.
> 
> respectively
> 
>     x, y: numeric vectors of data values.  ‘x’ and ‘y’ must have the
>           same length.
> 
> so noone ever claimed that class "ordered" variables should work.
> 
> However, the root cause is that as.vector on a factor variable (ordered
> or not) converts it to a character vector, hence
> 
> > rank(as.vector(as.ordered(9:11)))
> [1] 3 1 2
> 
> Looks like a simple fix would be to use as.vector(x, "numeric") inside
> the definition of cor().
> 
> 
> >> cor(9:11,1:3,method="k")
> > [1] 1
> >> cor(as.ordered(9:11),1:3,method="k")
> > [1] -0.3333333
> >> cor.test(as.ordered(9:11),1:3,method="k")
> > 
> > 	Kendall's rank correlation tau
> > 
> > data:  as.ordered(9:11) and 1:3 
> > T = 1, p-value = 1
> > alternative hypothesis: true tau is not equal to 0 
> > sample estimates:
> >        tau 
> > -0.3333333 
> > 
> >> cor(9:11,1:3,method="s")
> > [1] 1
> >> cor(as.ordered(9:11),1:3,method="s")
> > [1] -0.5
> >> cor.test(as.ordered(9:11),1:3,method="s")
> > 
> > 	Spearman's rank correlation rho
> > 
> > data:  as.ordered(9:11) and 1:3 
> > S = 0, p-value = 0.3333
> > alternative hypothesis: true rho is not equal to 0 
> > sample estimates:
> > rho 
> >   1
> > 
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 
> -- 
>    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)              FAX: (+45) 35327907
> 


The information in this e-mail is intended only for the ...{{dropped:13}}

**END
Comment 3 Jitterbug compatibility account 2010-02-08 23:09:10 UTC
From: Prof Brian Ripley <ripley@stats.ox.ac.uk>
On Mon, 8 Feb 2010, Peter Dalgaard wrote:

> msa@biostat.mgh.harvard.edu wrote:
>> Full_Name: Marek Ancukiewicz
>> Version: 2.10.1
>> OS: Linux
>> Submission from: (NULL) (74.0.49.2)
>>
>>
>> Both cor() and cor.test() incorrectly handle ordered variables with
>> method="kendall", cor() incorrectly handles ordered variables for
>> method="spearman" (method="person" always works correctly, while
>> method="spearman" works for cor.test, but not for cor()).
>>
>> In erroneous calculations these functions ignore the inherent ordering
>> of the ordered variable (e.g., '9'<'10'<'11') and instead seem to assume
>> an alphabetic ordering ('10'<'11'<'9').
>
> Strictly speaking, not a bug, since the documentation has
>
>       x: a numeric vector, matrix or data frame.
>
> respectively
>
>    x, y: numeric vectors of data values.  ‘x’ and ‘y’ must have the
>          same length.
>
> so noone ever claimed that class "ordered" variables should work.
>
> However, the root cause is that as.vector on a factor variable (ordered
> or not) converts it to a character vector, hence
>
>> rank(as.vector(as.ordered(9:11)))
> [1] 3 1 2
>
> Looks like a simple fix would be to use as.vector(x, "numeric") inside
> the definition of cor().

A fix for that particular case: the problem is that relies on the 
underlying representation.  I think a better fix would be to do either 
of

- test for numeric and throw an error otherwise, or
- use xtfrm, which has the advantage of being more general and
   allowing methods to be written (S3 or S4 methods in R-devel).

>
>
>>> cor(9:11,1:3,method="k")
>> [1] 1
>>> cor(as.ordered(9:11),1:3,method="k")
>> [1] -0.3333333
>>> cor.test(as.ordered(9:11),1:3,method="k")
>>
>> 	Kendall's rank correlation tau
>>
>> data:  as.ordered(9:11) and 1:3
>> T = 1, p-value = 1
>> alternative hypothesis: true tau is not equal to 0
>> sample estimates:
>>        tau
>> -0.3333333
>>
>>> cor(9:11,1:3,method="s")
>> [1] 1
>>> cor(as.ordered(9:11),1:3,method="s")
>> [1] -0.5
>>> cor.test(as.ordered(9:11),1:3,method="s")
>>
>> 	Spearman's rank correlation rho
>>
>> data:  as.ordered(9:11) and 1:3
>> S = 0, p-value = 0.3333
>> alternative hypothesis: true rho is not equal to 0
>> sample estimates:
>> rho
>>   1
>>
>> ______________________________________________
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
> --
>   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
> (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)              FAX: (+45) 35327907
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
Comment 4 Jitterbug compatibility account 2010-02-14 20:05:00 UTC
NOTES:
 Tests added for misuse in 2.11.0
(Note that the suggested change would still allow nonsense results for unordered
factors.)
Comment 5 Jitterbug compatibility account 2010-02-14 20:05:59 UTC
Audit (from Jitterbug):
Sun Feb 14 14:05:59 2010	ripley	changed notes
Sun Feb 14 14:05:59 2010	ripley	moved from incoming to Analyses-fixed