Summary ------- Floating point errors can cause a data vector to have an ultra-small inter-quartile range, which causes `grDevices::nclass.FD` to suggest an absurdly large number of breaks to `graphics::hist(breaks="FD")`. Because this large float becomes NA when converted to integer, hist's call to `base::pretty` crashes. How could nclass.FD, which has the job of suggesting a reasonable number of classes, avoid suggesting an absurdly large number of classes when the inter-quartile range is absurdly small compared to the range? Steps to reproduce ------------------ hist(c(1, 1, 1, 1 + 1e-15, 2), breaks="FD") Observed behaviour ------------------ Running this code gives the following error message: Error in pretty.default(range(x), n = breaks, min.n = 1): invalid 'n' argument In addition: Warning message: In pretty.default(range(x), n = breaks, min.n = 1) : NAs introduced by coercion to integer range Expected behaviour ------------------ That hist() should never crash when given valid numerical data. Specifically, that it should be robust even to those rare datasets where (through floating point inaccuracy) the inter-quartile range is tens of orders of magnitude smaller than the range. Analysis -------- Dramatis personae: * graphics::hist.default https://svn.r-project.org/R/trunk/src/library/graphics/R/hist.R * grDevices::nclass.FD https://svn.r-project.org/R/trunk/src/library/grDevices/R/calc.R * base::pretty.default https://svn.r-project.org/R/trunk/src/library/base/R/pretty.R `nclass.FD` examines the inter-quartile range of `x`, and gets a positive, but very small floating point value -- let's call it TINYFLOAT. It inserts this ultra-low IQR into the `nclass` denominator, which means `nclass` becoms a huge number -- let's call it BIGFLOAT. `nclass.FD` then returns this huge value to `hist`. Once `hist` has its 'number of breaks' suggestion, it feeds this number to `pretty`: pretty(range(x), BIGFLOAT, min.n = 1) `pretty`, in turn, calls .Internal(pretty(min(x), max(x), BIGFLOAT, min.n, shrink.sml, c(high.u.bias, u5.bias), eps.correct)) Which fails with the error and warning shown at start of this e-mail. (Invalid 'n' argument / NA's introduced by coercion to integer range.) My reading is that .Internal tried to coerce BIGFLOAT to integer range and produced an NA, and that (the C implementation of) `pretty`, in turn, choked when confronted with NA.

I can confirm the issue. For future reference, this isn't a "crash", just a failure with an error message, but I agree it shouldn't happen.

(In reply to Duncan Murdoch from comment #1) > I can confirm the issue. For future reference, this isn't a "crash", just a > failure with an error message, but I agree it shouldn't happen. I agree. However, arguably this is not just a problem of "Accuracy". If you look at this case (and its obvious generalization into the one-dimensional manifold of problems when replacing 10^-15 by a parameter epsilon) and the nclass.FD() formula(s), you may start to see that it is rather a problem of how to compute robust scale estimates in such cases, notably if you need a robust scale estimate which should be positive. Interestingly FD (= Freedman-Diaconis) do consider scale 0, but then use the MAD() instead of the IQR() ... but that latter "design" seems a bit reverse here: The MAD() is more robust than the IQR() and hence more typically 0 than the IQR for such situations (of "inlier"s (<-> "outlier")). I think we should fix along the following ideas: The 'h' used in nclass.FD() should "underflow to 0" in this case, i.e., a reasonable _robust_ scale estimate should be 0, at least in the case of epsilon=1e-15, and so nclass.FD() would return 1. There's more to this: FD (Freedman Diaconis) really only mentioned the IQR, not the extra clause using the MAD: https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule clearly states the FD rule as choosing h = 2 IQR(x) / length(x)^{1/3} (you can get to the PDF of the original Springer paper). The extra step that happens in nclass.fd(), chosing MAD instead of IQR if the latter is zero, seems to be a later invention.... and a pretty 'strange' one: I cannot easily find an 'x' for which IQR(x) is zero but MAD(x) is not; i.e., the extra if(.) does not seem to help at all ( can anyone else find a case for which the quartiles are equal, but the MAD is not zero ? It can only happen in numeric/boundary cases: In such a case, the middle half of the sample are all identical observations, identical to the median ==> half of the "deviations from the median" are already zero ). Interestingly enough, the guy that introduced the mad(.) clause r40860 | maechler | 2007-03-21 11:56:55 | must have had different thoughts back then (;-) After all these musings, I still think we should treat this similar to how the smooth.spline() had dealt with 'x' values that were very close, but not quite equal: Use 'signif(x)' instead of 'x' in the crucial case. Since this also changes the result --- **very** rarely --- for "regular" cases, I'd propose to also replace the previous if(h == 0) h <- mad(x, constant = 2) by something more reasonable. Martin

(In reply to Duncan Murdoch from comment #1) > I can confirm the issue. For future reference, this isn't a "crash", just a > failure with an error message, but I agree it shouldn't happen. Good point, noted. (In reply to Martin Maechler from comment #2) > I think we should fix along the following ideas: > > The 'h' used in nclass.FD() should "underflow to 0" in this case, i.e., a > reasonable _robust_ scale estimate should be 0, at least in the case of > epsilon=1e-15, and so nclass.FD() would return 1. [...] > After all these musings, I still think we should treat this similar to how > the smooth.spline() had dealt with 'x' values that were very close, but not > quite equal: Use 'signif(x)' instead of 'x' in the crucial case. One option might be be `robust_h <- zapsmall(c(h, diff(range(x))))[1]`? By zapsmall's behaviour, `robust_h` will be 0 if `h` is at least `getOption('digits')` orders of magnitude smaller than `range(x)`. By default, `digits` is 7, so that's 7 orders of magnitude. This means the smallest IQR that doesn't get zapped would still lead to quite a few histogram breaks: data <- c(1, 1, 1, 1 + 1e-6, 2) nclass.FD(data) == 854988 hist(data) ## does not cras Although 800k is still far more breaks than makes sense on a computer screen, it is well smaller than MAXINT, and so the problem is averted. As for the question 'does the solution catch all errors?', I don't know enought about floating point numbers. Can one safely assume that floating point imprecision usually does not get within 6 orders of magnitude of the data? (Well, zapsmall implicitly assumes that, and it seems to work.) > Since this also changes the result --- **very** rarely --- for "regular" cases, > I'd propose to also replace the previous > if(h == 0) h <- mad(x, constant = 2) > by something more reasonable. Since Freedman-Diaconis is undefined for IQRs of (nearly) 0, would it make sense to simply fall back to nclass.Sturges in those cases? I agree that the median absolute deviance isn't much help. The wiki page on FD says Sturges works well for data where n < 200; I expect pathological IQR, too, will be more common in very small datasets. Combining these thoughts (using zapsmall to underflow the h to 0; throwing away MAD; and falling back to Sturges if h is 0) would together result in something like this, modulo nitpicks and coding style. (Especially, h and safe_h could be combined into a one-liner; but for now writing them out is clearer.) Thoughts? nclass.FD <- function (x) { h <- stats::IQR(x) safe_h <- zapsmall(c(h, diff(range(x))))[1] if (safe_h == 0) nclass.Sturges(x) else ceiling(diff(range(x))/(2 * h * length(x)^(-1/3))) }

I'm following your idea of reverting to another method when reasonable ways trying to improve nclass.FD() "fail" - in the sense of being based on 'h = 0'. But I'd strongly prefer Scott to Sturges in that case. Also you had proposed to use zapsmall() on the diff() which I take as idea but apply doing signif(x, digits = 5) [5 seems better than 7, e.g.] at least for one set of examples. Also, I'm trying quite a bit harder to get a "robust" scale, by going from quartiles, to octiles, "sixteen-tiles" etc, until I give up. This gives currently nclass.FD <- function(x) { h <- 2 * stats::IQR(x. <- signif(x, digits = 5)) if (h == 0) { x. <- sort(x.) al <- 1/4; al.min <- 1/512 # try quantiles 1/8, 1/16, ... 1/512 while(h == 0 && (al <- al/2) >= al.min) h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / al } if (h == 0) ## revert to Scott's: h <- 3.5 * sqrt(stats::var(x)) if (h > 0) ceiling(diff(range(x))/h * length(x)^(1/3)) else 1L } which still fulfills another requirement: To give the same result as the R 3.4.1 nclass.FD() in cases where that is "ok". --- However, in the mean time I found other examples which *still* give huge nclass.FD().. a case you cannot reduce to rounding / zapsmall problems, etc, but rather to an inherent problem of the FD rule: The histogram *bin* `h` is very robustly estimated, but nclass ~= diff(range(x)) / h is of course very anti-robust, and so, both with old and new nclass.FD(), you currently get > nclass.FD(c(1:9, 1e300*c(-1,1))) [1] 4.44796e+299 and I think this is a necessary consequence of FD's definition. Currently, I would think we should set an artificial upper bound such as 10^9 or a bit smaller ... or then, after all this thinking and trying, conclude that nclass.FD() inherently is unbounded.. and maybe just document that. I *will* change our definition of nclass.FD() and use something like the above in any case. The earlier change (in 2007) which used mad() was really embarrassingly non-sensical. Note that there's also a small change for nclass.scott() for the case of the above data, where it currently gives 0 and that *is* too small.

I've thought further, and my plan is now: 1) Use and document nclass.FD() as shown in comment 4. 2) In hist.default(), in the case 'breaks' is a single number (to indicate the *number* of breaks), use an upper bound of 1e9, and warn if breaks was larger. [[a user can always construct her full 'breaks' vector and pass it to hist() to get larger number of breaks and no warning]] I will commit this within a day or so, unless there are comments here.

Fix committed as planned, to R-devel svn r72977 and R 3.4.1 patched, r72978

(In reply to Martin Maechler from comment #5) > I've thought further, and my plan is now: > > 1) Use and document nclass.FD() as shown in comment 4. > > 2) In hist.default(), in the case 'breaks' is a single number (to indicate > the *number* of breaks), > use an upper bound of 1e9, and warn if breaks was larger. > [[a user can always construct her full 'breaks' vector and pass it to > hist() to get larger number of breaks and no warning]] > > I will commit this within a day or so, unless there are comments here. As we only found now, '1e9' is really too large [it only worked nicely on 64-bit where in the example pretty(*, 1e9, min.n=1) produced only 20 points], and I've changed it to 1e6 now. (People with large memory can always explicitly define `breaks <- seq(....)` and use that directly).

(In reply to Martin Maechler from comment #4) > nclass.FD <- function(x) > { > h <- 2 * stats::IQR(x. <- signif(x, digits = 5)) > if (h == 0) { > x. <- sort(x.) > al <- 1/4; al.min <- 1/512 # try quantiles 1/8, 1/16, ... 1/512 > while(h == 0 && (al <- al/2) >= al.min) > h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / al > } > if (h == 0) ## revert to Scott's: > h <- 3.5 * sqrt(stats::var(x)) > if (h > 0) ceiling(diff(range(x))/h * length(x)^(1/3)) else 1L > } Isn't h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / (1-2*al) more reasonable? 2 = 1 / ((3/4) - (1/4)) 1 / (1-2*al) = 1 / ((1-al) - al)

(In reply to Suharto Anggono from comment #8) > (In reply to Martin Maechler from comment #4) > > nclass.FD <- function(x) > > { > > h <- 2 * stats::IQR(x. <- signif(x, digits = 5)) > > if (h == 0) { > > x. <- sort(x.) > > al <- 1/4; al.min <- 1/512 # try quantiles 1/8, 1/16, ... 1/512 > > while(h == 0 && (al <- al/2) >= al.min) > > h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / al > > } > > if (h == 0) ## revert to Scott's: > > h <- 3.5 * sqrt(stats::var(x)) > > if (h > 0) ceiling(diff(range(x))/h * length(x)^(1/3)) else 1L > > } > > Isn't > h <- diff(stats::quantile(x., c(al, 1-al), names = FALSE)) / (1-2*al) > more reasonable? > 2 = 1 / ((3/4) - (1/4)) > 1 / (1-2*al) = 1 / ((1-al) - al) Yes, thank you: "1 / al" is really bad. That was a thinko. However 1/(1 - 2*al) is not so much better... and also clearly biased... unless the true distribution of x is uniform ! We could use the asymptotically correct factor under a Gaussian distribution, 1 / (-2*qnorm(al)) {qnorm(1-a) - qnorm(a) = -2*qnorm(a) } or to be robust use t_4 (e.g.) instead of Gaussian. Here's some code for experimentation: ##' Generalized IQR() - notably for when IQR = 0 : gIQR <- function(x, alphas = 1/2^(2:9)) { stopifnot(0 <= alphas, alphas < 1/2) if(max(abs(log2(alphas) - round(log2(alphas)))) < 1e-7) names(alphas) <- sprintf("1/%d", 1/alphas) vapply(alphas, function(al) ## asymptotical correction factor: to Normal-consistent scale: ## qnorm(1-a) - qnorm(a) == -2*qnorm(a) : ## diff(stats::quantile(x, c(al, 1-al), names = FALSE)), numeric(1)) / (- 2*qnorm(alphas)) ## or be more robust and take t_4 as "compromise": diff(stats::quantile(x, c(al, 1-al), names = FALSE)), numeric(1)) / (- 2*qt(alphas, df=4)) ## diff(stats::quantile(x, c(al, 1-al), names = FALSE)), numeric(1)) / (1 - 2*alphas) ## really wrong (first in nclass.FD()): ## diff(stats::quantile(x, c(al, 1-al), names = FALSE)), numeric(1)) / (alphas) } set.seed(17) rr <- replicate(64, gIQR(rnorm(100))) t(apply(rr, 1, summary)) # when using qnorm() in gIQR(): ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1/4 0.724 0.897 0.959 0.973 1.038 1.27 ## 1/8 0.711 0.897 0.944 0.956 1.021 1.15 ## 1/16 0.752 0.906 0.971 0.967 1.019 1.19 ## 1/32 0.736 0.902 0.936 0.951 0.989 1.17 ## 1/64 0.730 0.873 0.925 0.932 0.980 1.13 ## 1/128 0.755 0.851 0.901 0.909 0.957 1.16 ## 1/256 0.737 0.814 0.858 0.880 0.946 1.13 ## 1/512 0.702 0.766 0.815 0.836 0.908 1.11 ## \ ## \-> systematically too small <--> a biased quantile extrapolation happens for small n boxplot(t(rr), notch=TRUE); abline(h = 1, lty=3) set.seed(7) ## considerably larger n rr <- replicate(256, gIQR(rnorm(20000))) boxplot(t(rr), notch=TRUE); abline(h = 1, lty=3) ## now looks good apart from 1/512 ## considerably larger n ## heavier tailed than assumed: rr <- replicate(256, gIQR(rt(20000, 3))) boxplot(t(rr), notch=TRUE); abline(h = 1, lty=3) ## --> growing ## ## Now assuming correct tails: t_4: rt4 <- replicate(256, gIQR(rt(20000, df=4))) boxplot(t(rt4), notch=TRUE); abline(h = 1, lty=3) rU <- replicate(256, gIQR(runif(20000))) boxplot(t(rU), notch=TRUE); abline(h = 1, lty=3) rN <- replicate(256, gIQR(rnorm(20000))) boxplot(t(rN), notch=TRUE); abline(h = 1, lty=3) -------------- but after all that. We do want to generalize the 2*IQR(x) which is used in the usual case where IQR() is not zero. Under X ~ N(*,sigma^2) lim_{n -> oo} E[2*IQR(X_n)] = 2*1.349*sigma ~= 2.7 sigma (and Scott's rule, the last resort we use also takes 3.5 * sqrt(var(.))... so maybe staying with 1/(1 - 2*al) is good enough.?

Given no further suggestions, I've decided to keep it simple and use your proposed 1/(1 - 2*al) Thank you, Suharto, once more!

(In reply to Martin Maechler from comment #4) [snip] > > Also you had proposed to use zapsmall() on the diff() which I take as idea > but apply doing signif(x, digits = 5) [5 seems better than 7, e.g.] at > least for one set of examples. > [snip] > > This gives currently > [snip] > > > which still fulfills another requirement: To give the same result as the R > 3.4.1 nclass.FD() in cases where that is "ok". x <- 1e5 + (1:9) x <- 1e8 + (1:9) In those cases, R 3.4.1 'nclass.FD' is "ok". If signif(x, digits = 5) is used, much precision is lost.