Bug 14397 - cmdscale may not remove the zero eigenvalue but largest negative one
cmdscale may not remove the zero eigenvalue but largest negative one
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Analyses
R 2.11.1
All All
: P5 normal
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-10-04 12:14 UTC by Jari Oksanen
Modified: 2010-10-17 06:52 UTC (History)
0 users

See Also:


Attachments
diff file of fixed cmdscale: remove zero eigenvalue instead of the last one (1.17 KB, patch)
2010-10-04 12:14 UTC, Jari Oksanen
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jari Oksanen 2010-10-04 12:14:24 UTC
Created attachment 1130 [details]
diff file of fixed cmdscale: remove zero eigenvalue instead of the last one

One of the eigenvalues will be zero in metric scaling performed by cmdscale(). Function cmdscale() assumes that the last eigenvalue is zero and always drops that one. However, with non-Euclidean distances/dissimilarities, some of the eigenvalues are negative and then the zero is not the last one, but the last eigenvalue will be the largest negative one. Here is an example with the 'eurodist' data (used in example(cmdscale)):

# original code: last eigenvalue removed
> cmdscale(eurodist, k=20, eig=TRUE)$eig
 [1]  1.953838e+07  1.185656e+07  1.528844e+06  1.118742e+06  7.893472e+05
 [6]  5.816552e+05  2.623192e+05  1.925976e+05  1.450845e+05  1.079673e+05
[11]  5.139484e+04 -1.862645e-09 -9.496124e+03 -5.305820e+04 -1.322166e+05
[16] -2.573360e+05 -3.326719e+05 -5.162523e+05 -9.191491e+05 -1.006504e+06
 
Here the zero eigenvalue is number 12 of magnitude -2e-9. The eigenvalues on both sides are 5e4 and -9e3 which are orders of magnitude higher. The last reported eigenvalue is -1e6. If we remove the zero eigenvalue (numbere 12), the code will return:

# fixed code: zero eigenvalue removed
> cmdscale(eurodist, k=20, eig=TRUE)$eig
 [1] 19538377.090 11856555.334  1528844.468  1118741.951   789347.203
 [6]   581655.207   262319.208   192597.562   145084.535   107967.307
[11]    51394.841    -9496.124   -53058.196  -132216.575  -257336.026
[16]  -332671.901  -516252.254  -919149.098 -1006503.960 -2251844.332

The removal of zero eigenvalue and keeping the last eigenvalue will influence the goodness-of-fit statistic (GOF item in result).

There are other problems with the GOF statistic, but I sine I already had an item on those in R-devel, I don't want to return to that issue. My report 

https://stat.ethz.ch/pipermail/r-devel/2009-June/053729.html

I attach a diff that seems to fix the problem.
Comment 1 Brian Ripley 2010-10-15 12:49:57 UTC
Correct definition (according to Cox & Cox, Marriott, Mardia ....) is now used (2.12.0 patched) when there are negative eigenvalues.
Comment 2 Jari Oksanen 2010-10-15 13:04:02 UTC
With revision 53321 for every k too high (in this case k > 11):

> m1 <- cmdscale(eurodist, eig=T, k=12)
Error in evec %*% diag(sqrt(ev), k) : non-conformable arguments
In addition: Warning messages:
1: In cmdscale(eurodist, eig = T, k = 12) :
  some of the first 12 eigenvalues are < 0
2: In y[1L + 0L:(m - 1L) * (n + 1L)] <- x :
  number of items to replace is not a multiple of replacement length
Comment 3 Jari Oksanen 2010-10-15 13:42:24 UTC
OK. This drops eigenvalue of -2e-9, but keeps 4e-9:

> cmdscale(as.matrix(eurodist)[1:12,1:12], eig=T, k=11)$eig
[1] 1.487721e+07 6.192559e+06 1.122789e+06 6.054082e+05 2.537557e+05
[6] 1.061486e+05 2.695193e+04 3.725290e-09

Probably in 50% of cases it does a nice job in cleaning about-zero eigenvalues
(i.e, when the about-zero ev  also is negative) and printing the result nicely
without switching to "scientific" notation. 

Now it doesn't make much sense to have two GOF statistics: they are equal since
there is first 

evalus <- e$values[e$values > 0]

then

GOF = sum(ev)/c(sum(abs(evalus)), sum(evalus[evalus > 0]))

Is this a "wishlist" item?

PS. My previous comment on 53321 was dated at 2010-10-15 08:04:02 EDT (in this server), and the "current" r53322 was r53323 | ripley | 2010-10-15 15:39:37 +0300 (Fri, 15 Oct 2010), i.e, 3 minutes later. So this was a race condition (and it really sounds like one).
Comment 4 Jari Oksanen 2010-10-15 14:10:19 UTC
I'll shut up soon, but one comment about GOF. I don't expect any changes here, nor that I'm taken as anything else than "confused". However, here is an R representation of Gower's explanation of negative axes.

Assume that we have a variant of cmdscale function that also returns negative eigenvalues (but drops the zero eigenvalue), and also returns scores ("points") for those mind boggling axes with negative eigenvalue, and these scaled by sqrt of the absolute value of that negative eigenvalue. To make imagination easier, there is such a program (wdcmdscale in vegan: written for quite a different purpose, but updated with these properties to educate its author). With this function we get:

> m <- wcmdscale(eurodist, eig=TRUE, k=20) #defaults to all ev's, so 'k' not needed
> posax <- m$points
> negax <- m$negaxes
> sum(eurodist - dist(posax))
[1] -41994.34
> sum(eurodist - sqrt(dist(posax)^2 - dist(negax)^2) )
[1] 1.909655e-10

This means that all axes with positive eigenvalues still gives an incomplete fit to the original non-Euclidean distances (but gives GOF=1 currently, unless changed while writing this). If we subtract the distances related to the negative eigenvalues we get a complete fit. This would seem to hint that in this sense (Gower's sense) an alternative measure of GOF would involve the sum of all eigenvalues, positive and negative. So there could be one GOF measure (the current one: sum of positive eigenvalues), or if there are two, the second could be bases on sum of all eigenvalues, positive and negative.

Well, I think that a dinner in an Italian restaurant with my wife does me better than this discussion with repeated "race conditions"
Comment 5 Brian Ripley 2010-10-15 17:15:41 UTC
But 53322 is current.

On Fri, 15 Oct 2010, r-bugs@r-project.org wrote:

> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14397
>
> --- Comment #2 from Jari Oksanen <jari.oksanen@oulu.fi> 2010-10-15 08:04:02 EDT ---
> With revision 53321 for every k too high (in this case k > 11):
>
>> m1 <- cmdscale(eurodist, eig=T, k=12)
> Error in evec %*% diag(sqrt(ev), k) : non-conformable arguments
> In addition: Warning messages:
> 1: In cmdscale(eurodist, eig = T, k = 12) :
>  some of the first 12 eigenvalues are < 0
> 2: In y[1L + 0L:(m - 1L) * (n + 1L)] <- x :
>  number of items to replace is not a multiple of replacement length
>
> -- 
> Configure bugmail: https://bugs.r-project.org/bugzilla3/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the assignee for the bug.
>
> _______________________________________________
> R-core list: https://stat.ethz.ch/mailman/listinfo/r-core
>


-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


Comment 6 Jari Oksanen 2010-10-17 06:52:00 UTC
These comments concern r53328 which is the current version of cmdscale.R *when* I write this.
The issues seem to be (mainly) fixed now. The sum of abs eigenvalues is correct since r53728

> sapply(mods, function(x) sum(abs(x$eig)))
 release   r53322   r53323   r53328 
39399569 36172885 36172885 41651413 

The GOF statistics were correct already from r53323:

> sapply(mods, function(x) x$GOF)
       release r53322    r53323    r53328
[1,] 0.8362071      1 0.8684672 0.8684672
[2,] 0.9107983      1 1.0000000 1.0000000

The only thing that is against documentation is the number of eigenvalues returned. File cmdsdale.Rd in r53728 reads:

  \item{eig}{the \eqn{n} eigenvalues computed during the scaling process if
    \code{eig} is true.  \strong{NB}: versions of \R before 2.12.1
    returned only \code{k} but were documented to return \eqn{n-1}.}

But the current version returns 'n' eigenvalues instead of the documented 'n-1':

> sapply(mods, function(x) length(x$eig))
release  r53322  r53323  r53328 
     20      11      11      21 

The eurodist data has 21 observations

> attr(eurodist, "Size")
[1] 21

The current version does not remove the zero eigenvalue (like defined in the literature From Gower to Mardia to Cox & Cox), but returns also the zero. Incidentally this was the subject of this bug report, but as documente in News.Rd, this is confused:

     \item \code{cmdscale(add = FALSE)} now uses the more common
      definition that there is a representation in \code{n-1} or less
      dimensions, and only dimensions corresponding to positive
      eigenvalues are used.  (Avoids confusion such as \PR{14397}.)

(this still has the confusion between \code{n-1} used and \code{n} returned)