The documentation of findInterval states:
"... the internal algorithm uses interval
search ensuring O(n * log(N)) complexity where ‘n <- length(x)’
(and ‘N <- length(vec)’). For (almost) sorted ‘x’, it will be
even faster, basically O(n)."
In fact the algorithm used to do the interval search is O(N + n log(N)), due to an initial check for sorted order and NAs. This was noted some time ago:
Note that even if one argues that "internal" in the sense of an internal call down to compiled code, the sentence is still incorrect because both anyNA and is.unsorted result in internal calls.
The ideal resolution to this (IMO) would be a parameter option to remove the checks. There are many HPC use-cases where sorted vectors are maintained and fast binary search is desired.
Perhaps a better solution would be to somehow track whether a vector is sorted. Then, there is no need to make an explicit assertion and code is more easily generalized.
Just as a point of clarity on the proposed resolution. Some checks could be made within the C findInterval call. One could throw an error if the next step in the algorithm yields a value that is NA or implied out of order data. This would catch most randomly ordered vectors, but would be less likely to throw an error on nearly-sorted with few NAs. This would not hurt the algorithmic complexity.
I'd be happy to see about making a patch if this is an acceptable direction.
O(N + N*log(N)) is asymptotically O(N*log(N)), so I'm not sure it's worth the added code complexity. I'm more amenable to the simple assertion of sortedness idea.
Martin Maechler pointed out that I lost the distinction between 'n' and 'N', so yes this matters when n >> N. We will discuss these ideas internally and get back to you.
The plan is for R to change here and "automagically" become fast for checks such as is.unsorted() or is.na(.) --- by memoizing such properties in the corresponding object.
Any movement on this? Perhaps in the mean time, the documentation could be changed so that it correctly states the computational complexity.