Bug 15364 - stats::kmeans() stalls
stats::kmeans() stalls
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Analyses
R 3.0.0
x86_64/x64/amd64 (64-bit) Linux
: P3 major
Assigned To: Martin Maechler
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-26 23:45 UTC by gibberwocky
Modified: 2015-03-28 00:25 UTC (History)
2 users (show)

See Also:


Attachments
data stored in pca.pred[,1] (26.37 KB, application/octet-stream)
2013-06-26 23:45 UTC, gibberwocky
Details

Note You need to log in before you can comment on or make changes to this bug.
Description gibberwocky 2013-06-26 23:45:29 UTC
Created attachment 1459 [details]
data stored in pca.pred[,1]

The kmeans() function of the stats package stalls/hangs under the following circumstances:

tmp <- kmeans(pca.pred[,1], centers=2, nstart=10)

I have attached a copy of the data stored in pca.pred[,1]. It appears that the lack of variance in the samples might be causing the calculation to stall for whatever reason.
Comment 1 gibberwocky 2013-06-27 12:39:05 UTC
The same data and function appears to work OK under R.2.15...
Comment 2 gibberwocky 2013-06-27 13:24:51 UTC
I've just installed R v2.15.3 on a different machine (same OS, Linux 12.04 LTS), and it stalls. So the situation seems more complicated than first thought.
Comment 3 gibberwocky 2013-06-27 14:20:58 UTC
The machine on which this works is actually running R v2.15.2 not 2.15.3 as previously thought. I've since tried R v2.15.2 on the other machine and it still stalls.

This all seems reminiscent of an issue that was previously reported:

https://stat.ethz.ch/pipermail/r-help//2013-February/347175.html
Comment 4 Martin Maechler 2013-06-27 16:38:16 UTC
Thank you; your example is (strictly speaking not completely) reproducible
for me on my Linux desktop....
so it should be easy to fix.

I'm having a go at it.
Martin
Comment 5 Martin Maechler 2013-07-01 13:42:19 UTC
(In reply to comment #4)
> Thank you; your example is (strictly speaking not completely) reproducible
> for me on my Linux desktop....
> so it should be easy to fix.
> 
> I'm having a go at it.

It's a rounding / precision problem that I can reproduce .. only on 64-bit BTW.
I've decided to *not* add numerical fuzz .. which would potentially also change cases that have been working previously.
Rather, we now catch too many 'steps' in the QTRAns routine.
Committed to R-devel and R '3.0.1 patched"

Martin
Comment 6 Gökcen Eraslan 2015-03-27 16:41:25 UTC
It seems the commit that fixes this bug report leads to another bug:

> kmeans(runif(2^31/50 + 1), 2)
Error in do_one(nmeth) : 
  (converted from warning) NAs introduced by coercion

However:

> kmeans(runif(2^31/50), 2)

works perfectly. The reason seems to be the overflow in the last line of the following code which was introduced by the commit fixing this bug[1]:

isteps.Qtran <- 50 * m
iTran <- c(as.integer(isteps.Qtran), integer(max(0, k - 1)))

Since m is the number of rows of the data on which the clustering is performed, 50 * m exceeds 32 bit signed integer limit 2^31 which is 2147483648.

Here are some users encountered the bug in question:

http://stackoverflow.com/questions/22396234/r-kmeans-nas-in-foreign-function-call-arg-13-error

http://stackoverflow.com/questions/26952232/kmeans-on-46-million-elements-coerces-na-values/

Although, this problem only exists in the default method of kmeans (Hartigan-Wong) in R > 3.0.1, it would be much better to again support datasets with nrows > 2^31/50, as it was in R 2.x.

[1] https://github.com/wch/r-source/commit/59322766f85722159a68aa92b1aceaa3f7c6f66e#diff-3e40799bd040612e2898ed3f76483a18R27
Comment 7 Martin Maechler 2015-03-27 22:23:24 UTC
(In reply to Gökcen Eraslan from comment #6)
> It seems the commit that fixes this bug report leads to another bug:
> 
> > kmeans(runif(2^31/50 + 1), 2)
> Error in do_one(nmeth) : 
>   (converted from warning) NAs introduced by coercion
> 
> However:
> 
> > kmeans(runif(2^31/50), 2)
> 
> works perfectly. The reason seems to be the overflow in the last line of the
> following code which was introduced by the commit fixing this bug[1]:
> 
> isteps.Qtran <- 50 * m
> iTran <- c(as.integer(isteps.Qtran), integer(max(0, k - 1)))
> 
> Since m is the number of rows of the data on which the clustering is
> performed, 50 * m exceeds 32 bit signed integer limit 2^31 which is
> 2147483648.
> 
> Here are some users encountered the bug in question:
> 
> http://stackoverflow.com/questions/22396234/r-kmeans-nas-in-foreign-function-
> call-arg-13-error
> 
> http://stackoverflow.com/questions/26952232/kmeans-on-46-million-elements-
> coerces-na-values/

I'm startled that no knowledgeable R user seems to have seen these questions till now, as indeed it is very easy to diagnose the problem as you did.
> 
> Although, this problem only exists in the default method of kmeans
> (Hartigan-Wong) in R > 3.0.1, it would be much better to again support
> datasets with nrows > 2^31/50, as it was in R 2.x.

Well yes,  "much better" being a bit strong. The limit is at 2^31 anyway.
But indeed,  the change is simple  and will be in R 3.2.0 and further.

Thank you, Gökcen!
Comment 8 Martin Maechler 2015-03-27 22:53:48 UTC
(In reply to Martin Maechler from comment #7)
> (In reply to Gökcen Eraslan from comment #6)

Bug fix commited to R 3.2.0 alpha and R-devel.

A "real" and "future proof" solution would rewrite the underlying Fortran code in C and use  R's  "long vector"s  when needed.
Comment 9 Gökcen Eraslan 2015-03-28 00:25:35 UTC
(In reply to Martin Maechler from comment #8)
> (In reply to Martin Maechler from comment #7)
> > (In reply to Gökcen Eraslan from comment #6)
> 
> Bug fix commited to R 3.2.0 alpha and R-devel.
> 
> A "real" and "future proof" solution would rewrite the underlying Fortran
> code in C and use  R's  "long vector"s  when needed.

Thanks a lot for the quick fix. I would also like to give the link to the commit on unofficial Github mirror for those who are interested:

https://github.com/wch/r-source/commit/969802cc242d69d2b99472d9c3a3201362f45429