Bug 4195 - hclust: median, centroid
hclust: median, centroid
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Analyses
old
All Linux
: P5 normal
Assigned To: Jitterbug compatibility account
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2003-09-17 01:22 UTC by Jitterbug compatibility account
Modified: 2012-04-15 00:48 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jitterbug compatibility account 2003-09-17 01:22:53 UTC
From: Peter Kleiweg <kleiweg@let.rug.nl>

There seems to be a bug in hclust (package mva) for clustering
methods 'median' and 'centroid'.

I have written a clustering program in C and discovered that the
results for 'median' differ from those of hclust in R. I used a
third program, written by someone else in Pascal, and that
program agrees with the output of my program.

I found yet another clustering program that seems to be built on
the same fortran code as was used for hclust. The source of this
code mentions a bug in the original code that effects both
methods 'median' and 'centroid'. This program has a fix for this
bug, but I can find no similar fix in the code of R's hclust.

You can find the program with the fix at:
http://www2.biology.ualberta.ca/jbrzusto/ftp/trees/source.zip
The relevant file is: qclust.c
The bug is mentioned at line 670 of that code.
The fix for the bug starts at line 908.

Unfortunatly, I do not know Fortran programming, so I can not
offer a tested solution for hclust. I hope I have located the
problem accurately enough for others to deal with it further.

You can find a data set to test this bug at:
http://www.let.rug.nl/~kleiweg/R/data
If you source this file, and then run:

    sort(hclust(d, method="median")$height)

... you will see a list with the last value:

    0.08449670

The correct value should be:

    0.081786


--please do not edit the information below--

Version:
 platform = i686-pc-linux-gnu
 arch = i686
 os = linux-gnu
 system = i686, linux-gnu
 status =
 major = 1
 minor = 7.1
 year = 2003
 month = 06
 day = 16
 language = R

Search Path:
 .GlobalEnv, package:methods, package:ctest, package:mva, package:modreg, package:nls, package:ts, Autoloads, package:base

Comment 1 Jitterbug compatibility account 2003-11-09 19:25:13 UTC
From: Prof Brian Ripley <ripley@stats.ox.ac.uk>
I've taken a look at this. What the R code does is to recalculate the
nearest neighbours & distances after updating the distances, for all
clusters other than the new one which it attempted to do on the fly.  
The problem is that merging two clusters can make distances to the cluster
go up and so what was a nearest neighbour may stop being so, as well as
the reverse.  So I am not convinced that the correction is in fact enough
(what if i2 had previously been the nearest neighbour of k?) although this 
may not affect the later steps.

It is as fast just to update all nearest neighbours, and I have changed
the R code to do so. It is a lot easier to convince oneself that the code
gives the correct answer! I now get an answer much closer to yours and
hope the difference is due to rounding errors in dumping the dataset. (It
is also what I got by incorporating the fix in the C code you pointed us
to.)

BDR

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Comment 2 Jitterbug compatibility account 2003-11-09 19:33:00 UTC
NOTES:
 probably fixed in 1.8.1
Comment 3 Jitterbug compatibility account 2003-11-09 20:39:32 UTC
Audit (from Jitterbug):
Sun Nov  9 14:39:32 2003	ripley	changed notes
Sun Nov  9 14:39:32 2003	ripley	foobar
Sun Nov  9 14:39:32 2003	ripley	moved from incoming to Analyses-fixed
Comment 4 Martin Maechler 2012-04-15 00:48:01 UTC
We need a different fix:
Before the fix (in 2003), hclust() was very fast for large n,
but it no longer was after the fix.

Instead I'm going back to the original report's  qclust.c
(which I have kept) and will commit another  hclust.f 
which will 
1) fix the bug
2) still be a fast hclust()