Created attachment 1130 [details] diff file of fixed cmdscale: remove zero eigenvalue instead of the last one One of the eigenvalues will be zero in metric scaling performed by cmdscale(). Function cmdscale() assumes that the last eigenvalue is zero and always drops that one. However, with non-Euclidean distances/dissimilarities, some of the eigenvalues are negative and then the zero is not the last one, but the last eigenvalue will be the largest negative one. Here is an example with the 'eurodist' data (used in example(cmdscale)): # original code: last eigenvalue removed > cmdscale(eurodist, k=20, eig=TRUE)$eig [1] 1.953838e+07 1.185656e+07 1.528844e+06 1.118742e+06 7.893472e+05 [6] 5.816552e+05 2.623192e+05 1.925976e+05 1.450845e+05 1.079673e+05 [11] 5.139484e+04 -1.862645e-09 -9.496124e+03 -5.305820e+04 -1.322166e+05 [16] -2.573360e+05 -3.326719e+05 -5.162523e+05 -9.191491e+05 -1.006504e+06 Here the zero eigenvalue is number 12 of magnitude -2e-9. The eigenvalues on both sides are 5e4 and -9e3 which are orders of magnitude higher. The last reported eigenvalue is -1e6. If we remove the zero eigenvalue (numbere 12), the code will return: # fixed code: zero eigenvalue removed > cmdscale(eurodist, k=20, eig=TRUE)$eig [1] 19538377.090 11856555.334 1528844.468 1118741.951 789347.203 [6] 581655.207 262319.208 192597.562 145084.535 107967.307 [11] 51394.841 -9496.124 -53058.196 -132216.575 -257336.026 [16] -332671.901 -516252.254 -919149.098 -1006503.960 -2251844.332 The removal of zero eigenvalue and keeping the last eigenvalue will influence the goodness-of-fit statistic (GOF item in result). There are other problems with the GOF statistic, but I sine I already had an item on those in R-devel, I don't want to return to that issue. My report https://stat.ethz.ch/pipermail/r-devel/2009-June/053729.html I attach a diff that seems to fix the problem.

Correct definition (according to Cox & Cox, Marriott, Mardia ....) is now used (2.12.0 patched) when there are negative eigenvalues.

With revision 53321 for every k too high (in this case k > 11): > m1 <- cmdscale(eurodist, eig=T, k=12) Error in evec %*% diag(sqrt(ev), k) : non-conformable arguments In addition: Warning messages: 1: In cmdscale(eurodist, eig = T, k = 12) : some of the first 12 eigenvalues are < 0 2: In y[1L + 0L:(m - 1L) * (n + 1L)] <- x : number of items to replace is not a multiple of replacement length

OK. This drops eigenvalue of -2e-9, but keeps 4e-9: > cmdscale(as.matrix(eurodist)[1:12,1:12], eig=T, k=11)$eig [1] 1.487721e+07 6.192559e+06 1.122789e+06 6.054082e+05 2.537557e+05 [6] 1.061486e+05 2.695193e+04 3.725290e-09 Probably in 50% of cases it does a nice job in cleaning about-zero eigenvalues (i.e, when the about-zero ev also is negative) and printing the result nicely without switching to "scientific" notation. Now it doesn't make much sense to have two GOF statistics: they are equal since there is first evalus <- e$values[e$values > 0] then GOF = sum(ev)/c(sum(abs(evalus)), sum(evalus[evalus > 0])) Is this a "wishlist" item? PS. My previous comment on 53321 was dated at 2010-10-15 08:04:02 EDT (in this server), and the "current" r53322 was r53323 | ripley | 2010-10-15 15:39:37 +0300 (Fri, 15 Oct 2010), i.e, 3 minutes later. So this was a race condition (and it really sounds like one).

I'll shut up soon, but one comment about GOF. I don't expect any changes here, nor that I'm taken as anything else than "confused". However, here is an R representation of Gower's explanation of negative axes. Assume that we have a variant of cmdscale function that also returns negative eigenvalues (but drops the zero eigenvalue), and also returns scores ("points") for those mind boggling axes with negative eigenvalue, and these scaled by sqrt of the absolute value of that negative eigenvalue. To make imagination easier, there is such a program (wdcmdscale in vegan: written for quite a different purpose, but updated with these properties to educate its author). With this function we get: > m <- wcmdscale(eurodist, eig=TRUE, k=20) #defaults to all ev's, so 'k' not needed > posax <- m$points > negax <- m$negaxes > sum(eurodist - dist(posax)) [1] -41994.34 > sum(eurodist - sqrt(dist(posax)^2 - dist(negax)^2) ) [1] 1.909655e-10 This means that all axes with positive eigenvalues still gives an incomplete fit to the original non-Euclidean distances (but gives GOF=1 currently, unless changed while writing this). If we subtract the distances related to the negative eigenvalues we get a complete fit. This would seem to hint that in this sense (Gower's sense) an alternative measure of GOF would involve the sum of all eigenvalues, positive and negative. So there could be one GOF measure (the current one: sum of positive eigenvalues), or if there are two, the second could be bases on sum of all eigenvalues, positive and negative. Well, I think that a dinner in an Italian restaurant with my wife does me better than this discussion with repeated "race conditions"

But 53322 is current. On Fri, 15 Oct 2010, r-bugs@r-project.org wrote: > https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14397 > > --- Comment #2 from Jari Oksanen <jari.oksanen@oulu.fi> 2010-10-15 08:04:02 EDT --- > With revision 53321 for every k too high (in this case k > 11): > >> m1 <- cmdscale(eurodist, eig=T, k=12) > Error in evec %*% diag(sqrt(ev), k) : non-conformable arguments > In addition: Warning messages: > 1: In cmdscale(eurodist, eig = T, k = 12) : > some of the first 12 eigenvalues are < 0 > 2: In y[1L + 0L:(m - 1L) * (n + 1L)] <- x : > number of items to replace is not a multiple of replacement length > > -- > Configure bugmail: https://bugs.r-project.org/bugzilla3/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are the assignee for the bug. > > _______________________________________________ > R-core list: https://stat.ethz.ch/mailman/listinfo/r-core > -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595

These comments concern r53328 which is the current version of cmdscale.R *when* I write this. The issues seem to be (mainly) fixed now. The sum of abs eigenvalues is correct since r53728 > sapply(mods, function(x) sum(abs(x$eig))) release r53322 r53323 r53328 39399569 36172885 36172885 41651413 The GOF statistics were correct already from r53323: > sapply(mods, function(x) x$GOF) release r53322 r53323 r53328 [1,] 0.8362071 1 0.8684672 0.8684672 [2,] 0.9107983 1 1.0000000 1.0000000 The only thing that is against documentation is the number of eigenvalues returned. File cmdsdale.Rd in r53728 reads: \item{eig}{the \eqn{n} eigenvalues computed during the scaling process if \code{eig} is true. \strong{NB}: versions of \R before 2.12.1 returned only \code{k} but were documented to return \eqn{n-1}.} But the current version returns 'n' eigenvalues instead of the documented 'n-1': > sapply(mods, function(x) length(x$eig)) release r53322 r53323 r53328 20 11 11 21 The eurodist data has 21 observations > attr(eurodist, "Size") [1] 21 The current version does not remove the zero eigenvalue (like defined in the literature From Gower to Mardia to Cox & Cox), but returns also the zero. Incidentally this was the subject of this bug report, but as documente in News.Rd, this is confused: \item \code{cmdscale(add = FALSE)} now uses the more common definition that there is a representation in \code{n-1} or less dimensions, and only dimensions corresponding to positive eigenvalues are used. (Avoids confusion such as \PR{14397}.) (this still has the confusion between \code{n-1} used and \code{n} returned)