Bug 17274 - floating point bug in nclass.FD can cause hist() to crash
Summary: floating point bug in nclass.FD can cause hist() to crash
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Accuracy (show other bugs)
Version: R 3.3.*
Hardware: x86_64/x64/amd64 (64-bit) Linux
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2017-05-19 15:08 UTC by Sietse Brouwer
Modified: 2017-08-09 16:27 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sietse Brouwer 2017-05-19 15:08:18 UTC
Summary
-------

Floating point errors can cause a data vector to have an ultra-small
inter-quartile range, which causes `grDevices::nclass.FD` to suggest
an absurdly large number of breaks to `graphics::hist(breaks="FD")`.
Because this large float becomes NA when converted to integer, hist's
call to `base::pretty` crashes.

How could nclass.FD, which has the job of suggesting a reasonable number of
classes, avoid suggesting an absurdly large number of classes when the
inter-quartile range is absurdly small compared to the range?


Steps to reproduce
------------------

    hist(c(1, 1, 1, 1 + 1e-15, 2), breaks="FD")


Observed behaviour
------------------

Running this code gives the following error message:

    Error in pretty.default(range(x), n = breaks, min.n = 1):
      invalid 'n' argument
    In addition: Warning message:
    In pretty.default(range(x), n = breaks, min.n = 1) :
      NAs introduced by coercion to integer range


Expected behaviour
------------------

That hist() should never crash when given valid numerical data. Specifically,
that it should be robust even to those rare datasets where (through floating
point inaccuracy) the inter-quartile range is tens of orders of magnitude
smaller than the range.


Analysis
--------

Dramatis personae:

* graphics::hist.default
  https://svn.r-project.org/R/trunk/src/library/graphics/R/hist.R

* grDevices::nclass.FD
  https://svn.r-project.org/R/trunk/src/library/grDevices/R/calc.R

* base::pretty.default
  https://svn.r-project.org/R/trunk/src/library/base/R/pretty.R

`nclass.FD` examines the inter-quartile range of `x`, and gets a positive, but
very small floating point value -- let's call it TINYFLOAT. It inserts this
ultra-low IQR into the `nclass` denominator, which means `nclass`
becoms a huge number -- let's call it BIGFLOAT. `nclass.FD` then returns this
huge value to `hist`.

Once `hist` has its 'number of breaks' suggestion, it feeds this
number to `pretty`:

    pretty(range(x), BIGFLOAT, min.n = 1)

`pretty`, in turn, calls

    .Internal(pretty(min(x), max(x), BIGFLOAT, min.n, shrink.sml,
        c(high.u.bias, u5.bias), eps.correct))

Which fails with the error and warning shown at start of this e-mail. (Invalid
'n' argument / NA's introduced by coercion to integer range.) My reading is
that .Internal tried to coerce BIGFLOAT to integer range and produced an NA,
and that (the C implementation of) `pretty`, in turn, choked when confronted
with NA.
Comment 1 Duncan Murdoch 2017-05-19 21:13:46 UTC
I can confirm the issue.  For future reference, this isn't a "crash", just a failure with an error message, but I agree it shouldn't happen.
Comment 2 Martin Maechler 2017-05-20 13:06:04 UTC
(In reply to Duncan Murdoch from comment #1)
> I can confirm the issue.  For future reference, this isn't a "crash", just a
> failure with an error message, but I agree it shouldn't happen.

I agree. However, arguably this is not just a problem of "Accuracy".

If you look at this case (and its obvious generalization into the one-dimensional manifold of problems when replacing 10^-15 by a parameter epsilon) and the nclass.FD() formula(s), you may start to see that it is rather a problem of how to compute robust scale estimates in such cases, notably if you need a robust scale estimate which should be positive.  Interestingly FD (= Freedman-Diaconis) do consider scale 0, but then use the MAD() instead of the IQR() ... but that latter "design" seems a bit reverse here: The MAD() is more robust than the IQR() and hence more typically 0 than the IQR for such situations (of "inlier"s (<-> "outlier")).

I think we should fix along the following ideas:

The 'h' used in  nclass.FD()  should "underflow to 0" in this case, i.e., a reasonable _robust_ scale estimate should be 0, at least in the case of
epsilon=1e-15, and so nclass.FD() would return 1.

There's more to this:   FD (Freedman Diaconis)  really only mentioned the IQR,
not the extra clause using the MAD:
https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule
clearly states the FD rule  as  choosing   h =  2 IQR(x) / length(x)^{1/3}

(you can get to the PDF of the original Springer paper).

The extra step that happens in  nclass.fd(),  chosing MAD instead of IQR if the latter is zero, seems to be a later invention.... 
and a pretty 'strange' one:  I cannot easily find an 'x' for which
IQR(x) is zero but MAD(x) is not; i.e., the extra if(.) does not seem to help at all ( can anyone else find a case for which the quartiles are equal, but the MAD is not zero ?  It can only happen in numeric/boundary cases: In such a case, the middle half of the sample are all identical observations, identical to the median ==> half of the "deviations from the median" are already zero ).

Interestingly enough, the guy that introduced the mad(.) clause 
  r40860 | maechler | 2007-03-21 11:56:55 |
must have had different thoughts back then  (;-)

After all these musings, I still think we should treat this similar to how the smooth.spline() had dealt with  'x' values that were very close, but not quite equal:  Use  'signif(x)'  instead of 'x' in the crucial case.

Since this also changes the result --- **very** rarely --- for "regular" cases,
I'd propose to also replace the previous    
     if(h == 0) h <- mad(x, constant = 2)
by something more reasonable.

Martin
Comment 3 Sietse Brouwer 2017-05-20 16:14:03 UTC
(In reply to Duncan Murdoch from comment #1)
> I can confirm the issue.  For future reference, this isn't a "crash", just a
> failure with an error message, but I agree it shouldn't happen.

Good point, noted.

(In reply to Martin Maechler from comment #2)

> I think we should fix along the following ideas:
>
> The 'h' used in  nclass.FD()  should "underflow to 0" in this case, i.e., a
> reasonable _robust_ scale estimate should be 0, at least in the case of
> epsilon=1e-15, and so nclass.FD() would return 1.
[...]
> After all these musings, I still think we should treat this similar to how
> the smooth.spline() had dealt with  'x' values that were very close, but not
> quite equal:  Use  'signif(x)'  instead of 'x' in the crucial case.

One option might be be `robust_h <- zapsmall(c(h, diff(range(x))))[1]`? By zapsmall's behaviour, `robust_h` will be 0 if `h` is at least `getOption('digits')` orders of magnitude smaller than `range(x)`. By default, `digits` is 7, so that's 7 orders of magnitude. This means the smallest IQR that doesn't get zapped would still lead to quite a few histogram breaks:

   data <- c(1, 1, 1, 1 + 1e-6, 2)
   nclass.FD(data) == 854988
   hist(data)  ## does not cras

Although 800k is still far more breaks than makes sense on a computer screen, it is well smaller than MAXINT, and so the problem is averted. As for the question 'does the solution catch all errors?', I don't know enought about floating point numbers. Can one safely assume that floating point imprecision usually does not get within 6 orders of magnitude of the data? (Well, zapsmall implicitly assumes that, and it seems to work.)

> Since this also changes the result --- **very** rarely --- for "regular" cases,
> I'd propose to also replace the previous    
>     if(h == 0) h <- mad(x, constant = 2)
> by something more reasonable.

Since Freedman-Diaconis is undefined for IQRs of (nearly) 0, would it make sense to simply fall back to nclass.Sturges in those cases? I agree that the median absolute deviance isn't much help. The wiki page on FD says Sturges works well for data where n < 200; I expect pathological IQR, too, will be more common in very small datasets.

Combining these thoughts (using zapsmall to underflow the h to 0; throwing away MAD; and falling back to Sturges if h is 0) would together result in something like this, modulo nitpicks and coding style. (Especially, h and safe_h could be combined into a one-liner; but for now writing them out is clearer.) Thoughts?

    nclass.FD <- function (x) {
        h <- stats::IQR(x)
        safe_h <- zapsmall(c(h, diff(range(x))))[1]
        if (safe_h == 0)
            nclass.Sturges(x)
        else
            ceiling(diff(range(x))/(2 * h * length(x)^(-1/3)))
    }
Comment 4 Martin Maechler 2017-07-25 18:06:52 UTC
I'm following your idea of reverting to another method when reasonable ways trying to improve nclass.FD() "fail" - in the sense of being based on  'h = 0'.
But I'd strongly prefer  Scott to Sturges in that case.

Also you had proposed to use zapsmall() on the diff()  which I take as idea 
but apply doing  signif(x, digits = 5)  [5 seems better than 7, e.g.]  at least for one set of examples.

Also, I'm trying quite a bit harder to get a "robust" scale,
by going from quartiles, to octiles, "sixteen-tiles" etc, until I give up.

This gives currently 

nclass.FD <- function(x)
{
    h <- 2 * stats::IQR(x. <- signif(x, digits = 5))
    if (h == 0) {
	x. <- sort(x.)
	al <- 1/4; al.min <- 1/512 # try quantiles 1/8, 1/16, ... 1/512
	while(h == 0 && (al <- al/2) >= al.min)
	    h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / al
    }
    if (h == 0) ## revert to Scott's:
	h <- 3.5 * sqrt(stats::var(x))
    if (h > 0) ceiling(diff(range(x))/h * length(x)^(1/3)) else 1L
}


which still fulfills another requirement: To give the same result as the R 3.4.1 nclass.FD()  in cases where that is "ok".

--- 
However, in the mean time I found other examples which *still* give huge nclass.FD().. a case you cannot reduce to rounding / zapsmall problems, etc, but rather to an inherent problem of the FD rule:  The histogram *bin* `h` is very robustly estimated, but nclass ~= diff(range(x)) / h  is of course very anti-robust, and so, both with old and new  nclass.FD(),  you currently get

> nclass.FD(c(1:9, 1e300*c(-1,1)))
[1] 4.44796e+299

and I think this is a necessary consequence of FD's definition.

Currently, I would think we should set an artificial upper bound such as 10^9 or a bit smaller ... 
or then, after all this thinking and trying, conclude that nclass.FD() inherently is unbounded.. and maybe just document that.

I *will* change our definition of nclass.FD() and use something like the above in any case.  The earlier change (in 2007) which used mad() was really embarrassingly non-sensical.
Note that there's also a small change for nclass.scott() for the case of the above data, where it currently gives 0 and that *is* too small.
Comment 5 Martin Maechler 2017-07-26 13:25:08 UTC
I've thought further, and my plan is now:   

1) Use and document  nclass.FD()  as shown in comment 4.

2) In hist.default(), in the case 'breaks' is a single number (to indicate the *number* of breaks),
   use an upper bound of 1e9, and warn if breaks was larger.
   [[a user can always construct her full 'breaks' vector and pass it to hist() to get larger number of breaks and no warning]] 

I will commit this within a day or so, unless there are comments here.
Comment 6 Martin Maechler 2017-07-27 10:12:58 UTC
Fix committed as planned, to R-devel svn r72977   and R 3.4.1 patched, r72978
Comment 7 Martin Maechler 2017-08-03 14:55:55 UTC
(In reply to Martin Maechler from comment #5)
> I've thought further, and my plan is now:   
> 
> 1) Use and document  nclass.FD()  as shown in comment 4.
> 
> 2) In hist.default(), in the case 'breaks' is a single number (to indicate
> the *number* of breaks),
>    use an upper bound of 1e9, and warn if breaks was larger.
>    [[a user can always construct her full 'breaks' vector and pass it to
> hist() to get larger number of breaks and no warning]] 
> 
> I will commit this within a day or so, unless there are comments here.

As we only found now,  '1e9' is really too large [it only worked nicely on 64-bit where in the example pretty(*, 1e9, min.n=1) produced only 20 points], and I've changed it to  1e6  now.

(People with large memory can always explicitly define `breaks <- seq(....)`  and use that directly).
Comment 8 Suharto Anggono 2017-08-07 15:26:12 UTC
(In reply to Martin Maechler from comment #4)
> nclass.FD <- function(x)
> {
>     h <- 2 * stats::IQR(x. <- signif(x, digits = 5))
>     if (h == 0) {
> 	x. <- sort(x.)
> 	al <- 1/4; al.min <- 1/512 # try quantiles 1/8, 1/16, ... 1/512
> 	while(h == 0 && (al <- al/2) >= al.min)
> 	    h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / al
>     }
>     if (h == 0) ## revert to Scott's:
> 	h <- 3.5 * sqrt(stats::var(x))
>     if (h > 0) ceiling(diff(range(x))/h * length(x)^(1/3)) else 1L
> }

Isn't
	    h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / (1-2*al)
more reasonable?
2 = 1 / ((3/4) - (1/4))
1 / (1-2*al) = 1 / ((1-al) - al)
Comment 9 Martin Maechler 2017-08-08 15:04:12 UTC
(In reply to Suharto Anggono from comment #8)
> (In reply to Martin Maechler from comment #4)
> > nclass.FD <- function(x)
> > {
> >     h <- 2 * stats::IQR(x. <- signif(x, digits = 5))
> >     if (h == 0) {
> > 	x. <- sort(x.)
> > 	al <- 1/4; al.min <- 1/512 # try quantiles 1/8, 1/16, ... 1/512
> > 	while(h == 0 && (al <- al/2) >= al.min)
> > 	    h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / al
> >     }
> >     if (h == 0) ## revert to Scott's:
> > 	h <- 3.5 * sqrt(stats::var(x))
> >     if (h > 0) ceiling(diff(range(x))/h * length(x)^(1/3)) else 1L
> > }
> 
> Isn't
> 	    h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / (1-2*al)
> more reasonable?
> 2 = 1 / ((3/4) - (1/4))
> 1 / (1-2*al) = 1 / ((1-al) - al)

Yes, thank you:   "1 / al"  is really bad. That was a thinko.

However  1/(1 - 2*al)  is not so much better... and also clearly biased... unless the true distribution of x is uniform !

We could use the asymptotically correct factor under a Gaussian distribution,
   1 / (-2*qnorm(al))    {qnorm(1-a) - qnorm(a) = -2*qnorm(a) }
or to be robust use t_4 (e.g.) instead of Gaussian.

Here's some code for experimentation:


##' Generalized IQR() - notably for when IQR = 0 :
gIQR <- function(x, alphas = 1/2^(2:9)) {
    stopifnot(0 <= alphas, alphas < 1/2)
    if(max(abs(log2(alphas) - round(log2(alphas)))) < 1e-7)
        names(alphas) <- sprintf("1/%d", 1/alphas)
    vapply(alphas, function(al)
        ## asymptotical correction factor: to Normal-consistent scale:
        ## qnorm(1-a) - qnorm(a) == -2*qnorm(a) :
        ## diff(stats::quantile(x, c(al, 1-al), names = FALSE)), numeric(1)) / (- 2*qnorm(alphas))
        ## or be more  robust and take  t_4  as "compromise":
        diff(stats::quantile(x, c(al, 1-al), names = FALSE)), numeric(1)) / (- 2*qt(alphas, df=4))
        ## diff(stats::quantile(x, c(al, 1-al), names = FALSE)), numeric(1)) / (1 - 2*alphas)
        ## really wrong (first in  nclass.FD()):
        ## diff(stats::quantile(x, c(al, 1-al), names = FALSE)), numeric(1)) / (alphas)
}

set.seed(17)
rr <- replicate(64, gIQR(rnorm(100)))
t(apply(rr, 1, summary)) # when using qnorm() in gIQR():
##        Min. 1st Qu. Median  Mean 3rd Qu. Max.
## 1/4   0.724   0.897  0.959 0.973   1.038 1.27
## 1/8   0.711   0.897  0.944 0.956   1.021 1.15
## 1/16  0.752   0.906  0.971 0.967   1.019 1.19
## 1/32  0.736   0.902  0.936 0.951   0.989 1.17
## 1/64  0.730   0.873  0.925 0.932   0.980 1.13
## 1/128 0.755   0.851  0.901 0.909   0.957 1.16
## 1/256 0.737   0.814  0.858 0.880   0.946 1.13
## 1/512 0.702   0.766  0.815 0.836   0.908 1.11
## \
##  \-> systematically too small  <--> a biased quantile extrapolation happens for small n
boxplot(t(rr), notch=TRUE); abline(h = 1, lty=3)

set.seed(7)
## considerably larger n
rr <- replicate(256, gIQR(rnorm(20000)))
boxplot(t(rr), notch=TRUE); abline(h = 1, lty=3)
## now looks good apart from 1/512

## considerably larger n
##  heavier tailed than assumed:
rr <- replicate(256, gIQR(rt(20000, 3)))
boxplot(t(rr), notch=TRUE); abline(h = 1, lty=3)
## --> growing
##
## Now assuming correct tails:  t_4:
rt4 <- replicate(256, gIQR(rt(20000, df=4)))
boxplot(t(rt4), notch=TRUE); abline(h = 1, lty=3)

rU <- replicate(256, gIQR(runif(20000)))
boxplot(t(rU), notch=TRUE); abline(h = 1, lty=3)
rN <- replicate(256, gIQR(rnorm(20000)))
boxplot(t(rN), notch=TRUE); abline(h = 1, lty=3)

--------------

but after all that.  We do want to generalize the  2*IQR(x)  which is used in the usual case where IQR() is not zero.  Under X ~ N(*,sigma^2)   lim_{n -> oo} E[2*IQR(X_n)] = 2*1.349*sigma ~= 2.7 sigma
(and Scott's rule, the last resort we use also takes  3.5 * sqrt(var(.))... 
so maybe  staying  with     1/(1 - 2*al)   is good enough.?
Comment 10 Martin Maechler 2017-08-09 15:22:54 UTC
Given no further suggestions,
I've decided to keep it simple and use your proposed  1/(1 - 2*al)

Thank you, Suharto, once more!
Comment 11 Suharto Anggono 2017-08-09 16:27:46 UTC
(In reply to Martin Maechler from comment #4)
[snip]
> 
> Also you had proposed to use zapsmall() on the diff()  which I take as idea 
> but apply doing  signif(x, digits = 5)  [5 seems better than 7, e.g.]  at
> least for one set of examples.
> 
[snip]
> 
> This gives currently 
> 
[snip]
> 
> 
> which still fulfills another requirement: To give the same result as the R
> 3.4.1 nclass.FD()  in cases where that is "ok".

x <- 1e5 + (1:9)
x <- 1e8 + (1:9)
In those cases, R 3.4.1 'nclass.FD' is "ok". If signif(x, digits = 5) is used, much precision is lost.