From: Peter Kleiweg <firstname.lastname@example.org>
There seems to be a bug in hclust (package mva) for clustering
methods 'median' and 'centroid'.
I have written a clustering program in C and discovered that the
results for 'median' differ from those of hclust in R. I used a
third program, written by someone else in Pascal, and that
program agrees with the output of my program.
I found yet another clustering program that seems to be built on
the same fortran code as was used for hclust. The source of this
code mentions a bug in the original code that effects both
methods 'median' and 'centroid'. This program has a fix for this
bug, but I can find no similar fix in the code of R's hclust.
You can find the program with the fix at:
The relevant file is: qclust.c
The bug is mentioned at line 670 of that code.
The fix for the bug starts at line 908.
Unfortunatly, I do not know Fortran programming, so I can not
offer a tested solution for hclust. I hope I have located the
problem accurately enough for others to deal with it further.
You can find a data set to test this bug at:
If you source this file, and then run:
... you will see a list with the last value:
The correct value should be:
--please do not edit the information below--
platform = i686-pc-linux-gnu
arch = i686
os = linux-gnu
system = i686, linux-gnu
major = 1
minor = 7.1
year = 2003
month = 06
day = 16
language = R
.GlobalEnv, package:methods, package:ctest, package:mva, package:modreg, package:nls, package:ts, Autoloads, package:base
From: Prof Brian Ripley <email@example.com>
I've taken a look at this. What the R code does is to recalculate the
nearest neighbours & distances after updating the distances, for all
clusters other than the new one which it attempted to do on the fly.
The problem is that merging two clusters can make distances to the cluster
go up and so what was a nearest neighbour may stop being so, as well as
the reverse. So I am not convinced that the correction is in fact enough
(what if i2 had previously been the nearest neighbour of k?) although this
may not affect the later steps.
It is as fast just to update all nearest neighbours, and I have changed
the R code to do so. It is a lot easier to convince oneself that the code
gives the correct answer! I now get an answer much closer to yours and
hope the difference is due to rounding errors in dumping the dataset. (It
is also what I got by incorporating the fix in the C code you pointed us
Brian D. Ripley, firstname.lastname@example.org
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
probably fixed in 1.8.1
Audit (from Jitterbug):
Sun Nov 9 14:39:32 2003 ripley changed notes
Sun Nov 9 14:39:32 2003 ripley foobar
Sun Nov 9 14:39:32 2003 ripley moved from incoming to Analyses-fixed
We need a different fix:
Before the fix (in 2003), hclust() was very fast for large n,
but it no longer was after the fix.
Instead I'm going back to the original report's qclust.c
(which I have kept) and will commit another hclust.f
1) fix the bug
2) still be a fast hclust()