Bug 16567 - Claims about computational complexity of findInterval
Summary: Claims about computational complexity of findInterval
Status: ASSIGNED
Alias: None
Product: R
Classification: Unclassified
Component: Documentation (show other bugs)
Version: R 3.2.2
Hardware: Other Other
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-10-15 22:24 UTC by Ian Fellows
Modified: 2017-02-16 23:10 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ian Fellows 2015-10-15 22:24:30 UTC
The documentation of findInterval states:

"...  the internal algorithm uses interval
     search ensuring O(n * log(N)) complexity where ‘n <- length(x)’
     (and ‘N <- length(vec)’).  For (almost) sorted ‘x’, it will be
     even faster, basically O(n)."

In fact the algorithm used to do the interval search is O(N + n log(N)), due to an initial check for sorted order and NAs. This was noted some time ago:

https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html

Note that even if one argues that "internal" in the sense of an internal call down to compiled code, the sentence is still incorrect because both anyNA and is.unsorted result in internal calls.


------------------------------------

The ideal resolution to this (IMO) would be a parameter option to remove the checks. There are many HPC use-cases where sorted vectors are maintained and fast binary search is desired.
Comment 1 Michael Lawrence 2015-10-15 22:33:07 UTC
Perhaps a better solution would be to somehow track whether a vector is sorted. Then, there is no need to make an explicit assertion and code is more easily generalized.
Comment 2 Ian Fellows 2015-10-15 22:50:32 UTC
Just as a point of clarity on the proposed resolution. Some checks could be made within the C findInterval call. One could throw an error if the next step in the algorithm yields a value that is NA or implied out of order data. This would catch most randomly ordered vectors, but would be less likely to throw an error on nearly-sorted with few NAs. This would not hurt the algorithmic complexity. 

I'd be happy to see about making a patch if this is an acceptable direction.
Comment 3 Michael Lawrence 2015-10-15 23:02:22 UTC
O(N + N*log(N)) is asymptotically O(N*log(N)), so I'm not sure it's worth the added code complexity. I'm more amenable to the simple assertion of sortedness idea.
Comment 4 Michael Lawrence 2015-10-16 11:02:00 UTC
Martin Maechler pointed out that I lost the distinction between 'n' and 'N', so yes this matters when n >> N. We will discuss these ideas internally and get back to you.
Comment 5 Martin Maechler 2015-11-02 11:02:46 UTC
The plan is for R to change here and "automagically" become fast for checks such as is.unsorted() or  is.na(.)   --- by memoizing such properties in the corresponding object.
Comment 6 Ian Fellows 2017-02-16 23:10:32 UTC
Any movement on this? Perhaps in the mean time, the documentation could be changed so that it correctly states the computational complexity.