Bug 17394 - Easy low-risk 100x speedup for getOption() -- patch included
Summary: Easy low-risk 100x speedup for getOption() -- patch included
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R 3.4.3
Hardware: All All
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2018-03-17 08:22 UTC by Matt Dowle
Modified: 2018-09-14 09:57 UTC (History)
5 users (show)

See Also:


Attachments
6 different variants of getOption() and timing measurements plus visualization via boxplots (11.57 KB, text/plain)
2018-09-11 15:56 UTC, Martin Maechler
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Dowle 2018-03-17 08:22:02 UTC
Please consider getOption :

> base::getOption
function (x, default = NULL) 
{
    if (missing(default) || x %in% names(options())) 
        .Internal(getOption(x))
    else default
}

Say you have a function that uses getOption and that function is quite small and quick, but is called a lot in a loop. Rprof() has shown that the speed culprit can be getOption. When the default= argument of getOption is used; i.e., most common usage; that's when it's slow, regardless of whether the option exists or not.

> options(MyTestOption=TRUE)

> system.time(for (i in 1:100000) getOption("NotExistsOption", TRUE))
   user  system elapsed 
  5.004   0.000   5.005
> system.time(for (i in 1:100000) getOption("NotExistsOption"))
   user  system elapsed 
  0.051   0.000   0.052
> system.time(for (i in 1:100000) getOption("MyTestOption", TRUE))
   user  system elapsed 
  4.997   0.000   4.997
> system.time(for (i in 1:100000) getOption("MyTestOption"))
   user  system elapsed 
  0.048   0.000   0.048


There is a 100x difference here: 5s vs 0.05s.  I believe it is due to %in%.

The proposal is to change it to this :

getOption <- function(x, default = NULL) {
    ans <- .Internal(getOption(x))
    if (is.null(ans)) default else ans
}

Since the .Internal(getOption()) is very fast at fetching the option's value if it exists and otherwise returning NULL.

If I run the 4 cases again, the two slow cases are fixed, and the two fast cases are still as fast. They are all at ~0.05s now.

> system.time(for (i in 1:100000) getOption("NotExistsOption", TRUE))
   user  system elapsed 
  0.066   0.000   0.065 
> system.time(for (i in 1:100000) getOption("NotExistsOption"))
   user  system elapsed 
  0.051   0.000   0.051 
> system.time(for (i in 1:100000) getOption("MyTestOption", TRUE))
   user  system elapsed 
  0.056   0.000   0.056 
> system.time(for (i in 1:100000) getOption("MyTestOption"))
   user  system elapsed 
  0.053   0.000   0.053 

It is all-win with no downside. Slightly simpler code too. I have checked getOption's manual page and I don't see any documented reason why %in% might be needed in the current implementation.

There is one reason I think could explain the way the function is currently written. Perhaps if the option does exist but has value NULL. If this is the case then getOption("MyOption", TRUE) should return NULL not TRUE. However this is not possible because an option cannot hold the value NULL.  Assigning NULL to an option removes that option, as follows :

> options(zzz=1L)
> tail(options(),2)
$width
[1] 200

$zzz
[1] 1

> options(zzz=NULL)
> tail(options(),2)
$warning.length
[1] 1000

$width
[1] 200

> 

If there is a reason for the %in%, I'm willing to work on and test further cases as you suggest.

Best, Matt

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.3 tools_3.4.3   
>
Comment 1 Peter Dalgaard 2018-03-17 17:40:23 UTC
AFAICT, this would break if default is non-NULL and an option has the value NULL.
Comment 2 Matt Dowle 2018-03-18 01:13:22 UTC
I already addressed this point in the paragraph starting "There is one reason". I included output after that paragraph demonstrating that assigning NULL to an option removes that option. Therefore, it is not possible for an option to hold the NULL value, afaik.
Comment 3 Henrik Bengtsson 2018-03-20 20:59:31 UTC
Funny, I was just going to submit an old patch addressing the problem where getOption() is slow when 'default' is specified. I found that that the culprit is 'options()' sorting the options by name each time.  A less intrusive but backward compatible patch is:

svn diff src/library/base/R/options.R
Index: src/library/base/R/options.R
===================================================================
--- src/library/base/R/options.R	(revision 74428)
+++ src/library/base/R/options.R	(working copy)
@@ -24,7 +24,7 @@
     ## To avoid always performing the %in%,
     ## we use the original code if default is not specified.
     ## if(missing(default)) return(options(x)[[1L]])
-    if(missing(default) || x %in% names(options()))
+    if(missing(default) || x %in% names(.Options))
 	.Internal(getOption(x))
     else
 	default


This is 20-30 times faster, e.g.

## With names(.Options):
> microbenchmark::microbenchmark(getOption("width", 80), unit = "us", times = 10e3)
Unit: microseconds
                   expr   min    lq     mean median    uq     max neval
 getOption("width", 80) 1.563 1.623 2.004667  1.663 1.771 2115.85 10000


## With names(options()):

> microbenchmark::microbenchmark(getOption("width", 80), unit = "us", times = 10e3)
Unit: microseconds
                   expr    min     lq     mean  median      uq      max neval
 getOption("width", 80) 63.872 64.758 68.15368 65.1425 66.1235 1919.728 10000



My investigation on this: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/52
Comment 4 Matt Dowle 2018-03-20 21:27:30 UTC
Henrik - can you explain to me please why my patch is "intrusive" and "backwards incompatible", as you put it? Did you see and read what I wrote above (and repeated) about NULL not being a valid option value? I do not understand. Please explain.
Comment 5 Henrik Bengtsson 2018-03-20 21:53:13 UTC
Hi Matt, sorry I didn't mean it in that way and I don't argue against your solution.

The "backward" compatible part was that my patch is a minimal backward compatible change by itself - independent of yours (it was from a sentence in a standalone bug submission I was about to hit submit on when I saw ours).

The "less intrusive" part was used in the sense that it would require less code to be traced and fewer things to be considered for acceptance, e.g. "what happens if R one day would support storing 'NULL' values?"

I guess the ideal implementation would to deal with 'default' internally, .Internal(getOption(x, default)).
Comment 6 Martin Maechler 2018-03-20 22:08:05 UTC
(In reply to Henrik Bengtsson from comment #3)
> Funny, I was just going to submit an old patch addressing the problem where
> getOption() is slow when 'default' is specified. I found that that the
> culprit is 'options()' sorting the options by name each time.  A less
> intrusive but backward compatible patch is:
> 
> svn diff src/library/base/R/options.R
> Index: src/library/base/R/options.R
> ===================================================================
> --- src/library/base/R/options.R	(revision 74428)
> +++ src/library/base/R/options.R	(working copy)
> @@ -24,7 +24,7 @@
>      ## To avoid always performing the %in%,
>      ## we use the original code if default is not specified.
>      ## if(missing(default)) return(options(x)[[1L]])
> -    if(missing(default) || x %in% names(options()))
> +    if(missing(default) || x %in% names(.Options))
>  	.Internal(getOption(x))
>      else
>  	default
> 
> 
> This is 20-30 times faster, e.g.
> 
> ....................

Indeed, this is very convincing and strictly back compatible.
I have found 3 (or 4) other such cases,  and replaced  options() by .Options in all of them.

I propose to close the issue even though  Matt is correct that in the current implementation options being NULL is equivalent for them to "not exist".
Note however, that from  ?options, there is at least one important option which is  NULL by default:  "error" ... and on help page its default 'NULL' value is even listed explictly among the "factory fresh" default options.

I do think I esthetically prefer  Matt's proposed getOption() to the current one, and I do agree with Math that currently his version should fully equivalent to the previous one. ... still I'm not sure if it worth the extra change,...
Comment 7 Peter Dalgaard 2018-03-20 22:30:28 UTC
However, Matt is right that it is not possible to have an option with NULL as the value (rather to my surprise). The default setting for "error" isn't really a value; try this in plain R (not RStudio):

> "error" %in% names(options())
[1] FALSE

Also, the usual "tricks" for creating NULL list elements don't seem to work, e.g.

> .Options["error"] <- list(NULL) 
> "error" %in% names(.Options)
[1] TRUE
> "error" %in% names(options())
[1] FALSE
Comment 8 Matt Dowle 2018-03-20 22:56:03 UTC
Peter - by any chance, before trying that, did you try options(error=NULL) ?  Because that removes the 'error' option. Subsequently, "error" %in% names(.Options) returns FALSE as you report.  In a fresh R session, I see TRUE.
Comment 9 Peter Dalgaard 2018-03-20 23:07:47 UTC
Matt: This was plain R in Terminal.app on a Mac. No previous diddling, doesn't seem to change with --vanilla option.

> "error" %in% names(.Options)
[1] FALSE
> "error" %in% names(options())
[1] FALSE

if you get something different, perhaps options("error") is non-NULL on your platform?
Comment 10 Matt Dowle 2018-03-20 23:16:35 UTC
Peter -- you're right. With R --vanilla, I now see the same as you. I have the line
  options(error=quote(dump.frames()))
in my ~/.Rprofile.
Comment 11 Martin Morgan 2018-03-20 23:19:00 UTC
Does it matter that "error" is documented as NULL? The current implementation doesn't respect this

> getOption("error", "shoes")
[1] "shoes"

I also do not have "error" %in% names(.Options), on 

> R.version
               _                                                 
platform       x86_64-pc-linux-gnu                               
arch           x86_64                                            
os             linux-gnu                                         
system         x86_64, linux-gnu                                 
status         Under development (unstable)                      
major          3                                                 
minor          5.0                                               
year           2018                                              
month          03                                                
day            20                                                
svn rev        74427                                             
language       R                                                 
version.string R Under development (unstable) (2018-03-20 r74427)
nickname       Unsuffered Consequences      

The current behavior was introduced by

$ svn diff -c67541 src/library/base/R/options.R 
Index: src/library/base/R/options.R
===================================================================
--- src/library/base/R/options.R	(revision 67540)
+++ src/library/base/R/options.R	(revision 67541)
@@ -23,7 +23,9 @@
 {
     ## To avoid always performing the %in%,
     ## we use the original code if default is not specified.
-    if(missing(default)) return(options(x)[[1L]])
-
-    if(x %in% names(options())) options(x)[[1L]] else default
+    ## if(missing(default)) return(options(x)[[1L]])
+    if(missing(default) || x %in% names(options()))
+	.Internal(getOption(x))
+    else
+	default
 }

where the new code now contradicts the comment at the top of the diff and can be viewed as a performance regression that Matt wishes to fix.
Comment 12 Henrik Bengtsson 2018-03-20 23:43:44 UTC
FWIW, from ?options: "[...] object .Options whose value is a pairlist containing the current options() (in no particular order). Assigning to it will make a local copy and not change the original.", i.e.

> .Options["error"] <- list(NULL)
> "error" %in% names(.Options)
[1] TRUE
> "error" %in% names(base::.Options)
[1] FALSE

So, the above is not a backdoor to inject a 'NULL' value into the R options.
Comment 13 Martin Maechler 2018-03-26 16:09:28 UTC
Before this thread falls asleep, a kind of summary:

1.  The majority of commmenters tends to agree that the change  Matt Dowle proposed originally is still an improvement [esthetically / speed wise] even after applying Henrik's proposed replacement  of  names(options()) by names(.Options) :

getOption <- function(x, default = NULL) {
    ans <- .Internal(getOption(x))
    if (is.null(ans)) default else ans
}


2.  There is the issue that no part of the code can really distinguish between an option that is set to NULL [which a user cannot do anyway currently]  and a non-existing option.
    That maybe considered a limitation of the current setup -- or not.

   I think if we ever get proposals to change that, they need to be a new thread altogether.
Comment 14 Tomas Kalibera 2018-03-27 13:40:16 UTC
A "default" argument like this is sometimes introduced exactly to support NULL as valid value. One can then define a token, e.g. "missingToken <- new.env()", ensure it is not passed anywhere else, and use it for the "default" argument. One example where this is important is returnValue(). To support this in getOption, we would have to modify the underlying C code (multiple functions).

From reading the C code behind getOption, I think that it was not the intention here. I think it was just to make it easier for the user to use a (non-null) default value in cases an option was not set. So I think we should be able to implement getOption as Matt suggests. Also there seems to be an error in the documentation

> For ‘getOption’, the current value set for option ‘x’, or ‘NULL’
>     if the option is unset.

probably this should say "or 'default' if the option is unset". Indeed, checking against all packages might be useful.

If the performance of getting options was a really critical issue, we might also rewrite to a hash table (use a hashed environment instead of a pairlist).
Comment 15 Martin Maechler 2018-03-28 09:06:25 UTC
After changing to  names(.Options),   the speedup is only about 4-5 x,  but still,  and as I said the new code is "nicer" in my view.

Note that I've also updated the documentation for problem Tomas mentioned.

With thanks to all discussants!
Comment 16 Matt Dowle 2018-03-28 18:09:18 UTC
Many thanks, Martin and all.
Very best,
Matt
Comment 17 Martin Maechler 2018-03-31 20:10:07 UTC
We were all wrong.
"We" actually do know better, but then we forget about it:  missing()  in R is more subtle than we think because in R  "missingness can be passed down" ...
and  "of course"   package authors have been relying on this feature :

The change is *NOT* back compatible, and actually does break one of Henrik's package checks (R.Matlab),  and more importantly breaks pkg parallelMap 's option settings, e.g.,

parallelMap::getParallelOptions()


I will revert the commit and add the following regression test:

## getOption(op, def) -- where 'def' is missing (passed down):
getO <- function(op, def) getOption(op, def)
stopifnot(is.null(getO("foobar")))
## failed for a few days in R-devel, ...
Comment 18 Henrik Bengtsson 2018-04-01 18:17:54 UTC
Thanks. For the record/to further clarify Martin's findings: the error when no default is specified occurred for *non-existing* options only, e.g.

> getO <- function(op, def) getOption(op, def)

## Existing option
> getO("digits", 2L)
[1] 7
> getO("digits")
[1] 7

## Non-existing option
> getO("non-existing")
Error in getOption(op, def) : argument "def" is missing, with no default
> getO("non-existing", 2L)
[1] 2


I believe it's pretty straightforward to workaround this in both R.matlab and parallelMap, e.g.

> getO <- function(op, def = NULL) getOption(op, def)

However, it's not unlikely there's more non-CRAN code out there that also break.  And, I agree that it's expected that missing:ness can be passed down also here.
Comment 19 Suharto Anggono 2018-04-13 21:31:52 UTC
The following version should be back-compatible.

getOption <- function(x, default = NULL) {
    ans <- .Internal(getOption(x))
    if (missing(default)) ans else if (is.null(ans)) default else ans
}
Comment 20 Matt Dowle 2018-09-03 21:44:20 UTC
I suspect Suharto's solution might have been missed because this is a closed issue. Could it be considered please.
Comment 21 Martin Maechler 2018-09-11 15:56:08 UTC
Created attachment 2368 [details]
6 different variants of getOption()  and timing measurements plus visualization via boxplots
Comment 22 Martin Maechler 2018-09-11 15:58:04 UTC
(In reply to Matt Dowle from comment #20)
    > I suspect Suharto's solution might have been missed because this is a closed
    > issue. Could it be considered please.

I  had done some timing measurements and at first found that his solution slightly deteriorated the performance for the very important  1-argument use case of getOption().  But those measurements relied on JIT compilation instead of using the current byte compiler (in R-devel!)..  We had some R core talk and Tomas Kalibera proposed a version of Suharto's which is less elegant but about 10% faster for the 1-argument case (which is used by 95--99% of all getOption() calls in the R sources!) because it avoids creating an intermediate R object:

    ## Tomas' version  (2018-09-10) from Suharto's:
    getOption_SAtk <- function(x, default = NULL) {
         if (missing(default))
             .Internal(getOption(x))
         else {
             ans <- .Internal(getOption(x))
             if (is.null(ans)) default else ans
         }
    }


Last evening I've started more serious timing measurements properly using byte compilation etc, and that experiment convinced me to indeed reopen the case,
because Suharto's version seems clearly better than the current R-devel code, and
Tomas' version of that buys even a bit more.

I've attached a nice R script (which contains a both the timing simulations and some visualization code) and currently propose to change (in R-devel) to the above version "by Suharto Anggono and Tomas Kalibera".

But you are welcome to experiment yourself and extend my attached script.
Comment 23 Matt Dowle 2018-09-11 21:51:49 UTC
This is great Martin. Thanks for all your time here. This proposal indeed looks ideal.

Since I originally raised this issue and suggested the 100x improvement by avoiding the expensive %in% with the code :
  getOption <- function(x, default = NULL) {
    ans <- .Internal(getOption(x))
    if (is.null(ans)) default else ans
  }
and the final solution is extremely similar, could I be included in the credit as well as the other people? I don't see anything in NEWS about getOption for the last change nor were there any credits in the previous commit messages. I wasn't thinking about credit and I didn't do it for credit, but now you've raised the question of credit, you know if other people get all the credit for a 100x speedup that I proposed and is very similar, I'd be a bit annoyed!
Comment 24 Martin Maechler 2018-09-14 09:57:02 UTC
This has been committed to R-devel yesterday.  There are too many contributors to this to list all in NEWS... I'm mentioning them in a comment in the source now.

Note Matt that your original claim was not correct:
- your patch was wrong
- the important speed improvement was for the very rarely used 2-argument call case
- it was not `%in%` that lead to unnecessary overhead.

But of course, I think we are all glad and grateful you brought up the topic and we iterated to a nice solution in the end!