Bug 16490

Summary: speedup for library
Product: R Reporter: Peter Haverty <phaverty>
Component: Low-levelAssignee: R-core <R-core>
Status: ASSIGNED ---    
Severity: enhancement CC: maechler, michafla
Priority: P5    
Version: R-devel (trunk)   
Hardware: Other   
OS: Other   
Attachments: patch for library, loadNamespace, and attach
additional fixes for package loading
updated, more complete patch
updated patch
proper replacement for "objects" calls
simpler possibleExtends
base:::.mergeImportMethods patch
possible extends and .identC
simple namespaces speedups
new and improved possibleExtends
a few small changes
simplification for methods::.getGeneric
replacement for simplification in .getGeneric
simplifications for 'getClassDef' and 'is'
using substr(x,1L,...) rather than grep("^...",x)
much simpler and faster 'is'
simplifications for getting classes and generics
replacement for simple namespaces speedups
yet newer simple namespace speedups

Description Peter Haverty 2015-07-28 00:45:52 UTC
Created attachment 1859 [details]
patch for library, loadNamespace, and attach

This patch is two s/objects/names/ and an s/lapply(get)/mget/. It is good for a ~5% reduction in package loading times.
Comment 1 Peter Haverty 2015-07-29 20:19:30 UTC
Created attachment 1863 [details]
additional fixes for package loading

Many substitutions of assign, get, methods::el and methods::elNamed to [[

The get and get0 changes are all in cases where we just tested for exists, have all the names of items in the environment, or already testing for NULL anyway.

The match in methods::possibleExtends that is a key bottleneck is now which( scalar == vector ), which is much faster.

This patch also replaces methods::el entirely with [[. I almost did the same for methods::elNamed, but there is one elNamed call that seems to be somehow different than [[.
Comment 2 Peter Haverty 2015-07-30 18:48:07 UTC
Created attachment 1864 [details]
updated, more complete patch

20% speedup for package loading, mostly replacing assign, get, methods::elNamed and methods::el with their .Primitive counterparts. There are a few other tricks in there for specific bottlenecks, such as possibleExtends.
Comment 3 Martin Maechler 2015-07-31 13:46:20 UTC
Thank you Peter.

The diffs look good - if viewed "from far" -
with some exceptions:

1) Completely removing  `el`  and `el<-`  from methods is somewhat drastic and does not belong to the speedup. ... even though you find that we don't need them anymore.

Looking closer (via 'R make check-all') I'm finding more problems :

2) I'm pretty sure, one "}" has been misplaced in .possibleExtends in methods/R/RClassUtils.R ... or actually even more is not quite correct in that change
 (When byte-compiling or other bootstrapping 'methods', this gave an error).
 
  Did you really check your changes {or just believed some quite imperfect  "devtools" ?}

3) The change in trace.R breaks at least one test;
   so  elNamed(.) is needed there, too

-------
Still things look good in general .. and I'm continuing to test
Comment 4 Peter Haverty 2015-07-31 14:41:06 UTC
Created attachment 1865 [details]
updated patch

reverted trace.R and addedFunctions.R (el), checked possibleExtends
Comment 5 Martin Maechler 2015-07-31 14:49:38 UTC
(In reply to Martin Maechler from comment #3)
> ...............
> Still things look good in general .. and I'm continuing to test

4) After applying the patch, something like

 require("stats4")
 getClasses("package:stats4") 

  is empty.  {this was found in checking "Matrix" as part of  'make check-all'}
 
  Of course easy to fix.  {though it shows the danger of replacing 
     'objects(where, ...)'  by basically  'names(where)' : The latter assumes an environment, the former works quite a bit more generally.

... ok continuing to test ...
Comment 6 Martin Maechler 2015-07-31 16:24:29 UTC
(In reply to Peter Haverty from comment #4)
> Created attachment 1865 [details]
> updated patch
> 
> reverted trace.R and addedFunctions.R (el), checked possibleExtends

Hmm.. but there's no change to the diff of  .possibleExtends()...

I'm keeping and testing my version.. now, and don't want to re-patch anyway.
hence, no need  for you to modify the patch again...

5) This an order of magnitude more hidden / subtle:

--------------------------------------------------------------------
> library(lme4)
Loading required package: Matrix
 
Warning message:
In namespaceImportMethods(ns, loadNamespace(j <- imp[[1L]], c(lib.loc,  :
  No generic function found corresponding to requested imported methods for "coerce" from package "Matrix" (malformed exports?)
--------------------------------------------------------------------
where of course the  Warning  is  not good.
I've debugged and tracked and found that indeed, the *imports* of lme4 (= the parent.env() of the namespace of lme4)  where only containing 90 something objects instead of 160 something...   so I guess the changes to namespace.R may be the problem..
....
yes, 10 minutes later: If I drop all your changes to namespace.R  the above
warning is gone..

Can you reproduce the problem above (with your changes, i.e., including the namespace.R ones) ?
...
and I was hoping to enjoy a week of vacations starting this weekend ... aar..
Comment 7 Martin Maechler 2015-07-31 17:10:44 UTC
I have now committed very small unrelated changes in the same files, you've patched, so we can more easily keep track..

Just found 

6) Some of your changes to library()  have completely eliminated the conflict warnings ... not good at all!

Test case:

   require(dplyr)

should show a note about all the re-definitions ...
Similar for
   require(Rmpfr)

... or take your favorite package which *gives* those notices in an unpatched version of R.
Comment 8 Michael Lawrence 2015-07-31 18:27:16 UTC
I bet the loss of conflict reporting is due to a replacement of a call to objects(), since objects() works with named search path elements, but names(), etc, do not.
Comment 9 Peter Haverty 2015-08-01 14:35:35 UTC
Created attachment 1868 [details]
proper replacement for "objects" calls

In some cases, objects calls within "checkConflicts" were using an undocumented behavior. Normally, the argument "pos" takes an integer and then works on the environment in that position on the search path. Unexpectedly, one can also pass an integer to the first argument and objects/ls will behave as if this integer came in through the "pos" argument.

In the end, an "as.environment" was all that was needed. We weren't seeing a subtle dependence on sorted output or the difference between .Internals and .Primitives. This makes me more confident that replacing "objects" with "names" is OK. I have confirmed that we retain the proper conflicts messages from "require(dplyr)" and "library(lme4)".

I have attached a patch that contains only the "objects" to "names" edits for "attach.R" and "library.R".
Comment 10 Martin Maechler 2015-08-01 20:44:14 UTC
(In reply to Peter Haverty from comment #9)
> Created attachment 1868 [details]
> proper replacement for "objects" calls
> 
> In some cases, objects calls within "checkConflicts" were using an
> undocumented behavior. Normally, the argument "pos" takes an integer and
> then works on the environment in that position on the search path.
> Unexpectedly, one can also pass an integer to the first argument and
> objects/ls will behave as if this integer came in through the "pos" argument.
> 
> In the end, an "as.environment" was all that was needed. We weren't seeing a
> subtle dependence on sorted output or the difference between .Internals and
> .Primitives. This makes me more confident that replacing "objects" with
> "names" is OK. I have confirmed that we retain the proper conflicts messages
> from "require(dplyr)" and "library(lme4)".
> 
> I have attached a patch that contains only the "objects" to "names" edits
> for "attach.R" and "library.R".

Yes, that one looks good.

However, the changes in  base/R/namespace.R (still in your elNamed3.diff)
do lead to  '5)' above -- don't you get the

Warning message:
In namespaceImportMethods(ns, loadNamespace(j <- imp[[1L]], c(lib.loc,  :
  No generic function found corresponding to requested imported methods for "coerce" from package "Matrix" (malformed exports?)


??
Debugging that part is quite a bit of work, and I have not succeeded much further than noticing that indeed, if I keep namespace.R  unchanged, the problem does not appear.
Comment 11 Peter Haverty 2015-08-01 20:46:57 UTC
Created attachment 1869 [details]
simpler possibleExtends

This is a simpler, corrected version of my possibleExtends patch.
Comment 12 Peter Haverty 2015-08-04 15:34:53 UTC
Created attachment 1871 [details]
base:::.mergeImportMethods patch

simple speedups for a small function
Comment 13 Peter Haverty 2015-08-05 00:46:44 UTC
Created attachment 1872 [details]
possible extends and .identC
Comment 14 Martin Maechler 2015-08-05 10:21:50 UTC
(In reply to Peter Haverty from comment #12)
> Created attachment 1871 [details]
> base:::.mergeImportMethods patch
> 
> simple speedups for a small function

Indeed. .. I saw another one.. and committed it to R-devel svn 68851
Comment 15 Peter Haverty 2015-08-06 00:55:39 UTC
Created attachment 1875 [details]
simple namespaces speedups

These changes are mostly s/get|get0|assign/[[/ changes. I also moved .hasS4metadata from methods to base, which I hope is OK. It is only ever called as methods:::.hasS4metadata, so perhaps it should live in base.

Also, the 25% of package load time spent in lazyLoadDBfetch seems to be triggered by get/exists calls, especially those ::: calls in namespace.R like methods:::getS4metadata. Therefore, I think it would be especially beneficial to remove as many get and ::: calls as possible to cut down on lazyLoadDBfetch calls.
Comment 16 Martin Maechler 2015-08-07 14:10:53 UTC
(In reply to Peter Haverty from comment #15)
> Created attachment 1875 [details]
> simple namespaces speedups

Thank you, Pete,

> 
> These changes are mostly s/get|get0|assign/[[/ changes. I also moved
> .hasS4metadata from methods to base, which I hope is OK. It is only ever
> called as methods:::.hasS4metadata, so perhaps it should live in base.

you should have asked first... as your patch shows it is used in methods/R/*.R even though only once.
Putting it in base adds another "public" object (base cannot hide objects) there, and we would try somewhat hard to avoid that. Also, if in base, everybody can use it and hence it be part of R's official API.

Are you really sure that the methods:::  part pays any noticable cost in e.g.
lazyLoadDBfetch ?

Otherwise, I'd strongly prefer a patch where that function remains in methods.
This may sound pedantic, but then this is about making good decisions..

> 
> Also, the 25% of package load time spent in lazyLoadDBfetch seems to be
> triggered by get/exists calls, especially those ::: calls in namespace.R
> like methods:::getS4metadata. Therefore, I think it would be especially
> beneficial to remove as many get and ::: calls as possible to cut down on
> lazyLoadDBfetch calls.
Comment 17 Peter Haverty 2015-08-09 21:39:13 UTC
Created attachment 1882 [details]
new and improved possibleExtends

Eliminates a "match" call that was one of the slower bits of the whole package loading process.  Also makes logic of the function more obvious/explicit.
Comment 18 Martin Maechler 2015-08-10 14:11:26 UTC
(In reply to Peter Haverty from comment #17)
> Created attachment 1882 [details]
> new and improved possibleExtends
> 
> Eliminates a "match" call that was one of the slower bits of the whole
> package loading process.  Also makes logic of the function more
> obvious/explicit.

Yes; this looks good.. and indeed is more efficient (ca 20%).
I've made a version with less  return( . ) 
and committed as svn rev 68978.
Comment 19 Peter Haverty 2015-08-11 19:15:18 UTC
Created attachment 1883 [details]
a few small changes
Comment 20 Martin Maechler 2015-08-12 09:55:40 UTC
(In reply to Peter Haverty from comment #19)
> Created attachment 1883 [details]
> a few small changes

Thank you.

I've committed these (w/ small modifications in is.R) and even more in is.R.
Comment 21 Peter Haverty 2015-08-15 18:09:16 UTC
Created attachment 1886 [details]
simplification for methods::.getGeneric

.getGeneric manually exchanges as.double for as.numeric. This is no longer necessary as these are set to be identical at the end of base/R/zzz.R
Comment 22 Martin Maechler 2015-08-15 20:05:46 UTC
(In reply to Peter Haverty from comment #21)
> Created attachment 1886 [details]
> simplification for methods::.getGeneric
> 
> .getGeneric manually exchanges as.double for as.numeric. This is no longer
> necessary as these are set to be identical at the end of base/R/zzz.R

but the other change replaces  get0(f, .GlobalEnv)   by  .GlobalEnv[[f]]
which is very different if remind yourself that the default of 'inherits = TRUE'  in get0(), e.g.,
     get0("lm", .GlobalEnv)  
typically returns stats::lm
Comment 23 Peter Haverty 2015-08-15 20:23:21 UTC
Created attachment 1887 [details]
replacement for simplification in .getGeneric

Now contains just the removal of the unnecessary s/as.double/as.numeric/
Comment 24 Peter Haverty 2015-08-16 15:51:14 UTC
Created attachment 1888 [details]
simplifications for 'getClassDef' and 'is'

There are two simplifications here involving class vectors with a length > 1 (S3 case). In one case it is significantly cheaper to just subset to the first one element rather than checking the length (even if the length is already 1L). In the other case, the rest of getClassDef is tolerant to a length > 1, so subsetting is unnecessary.  Also, I have a replacement for a call to 'identical' that is faster and similarly clear.

This patch contains a request to deprecate calling 'is' with one argument, which is simply a synonym for 'extends'. As these two functions ('is' and 'getClassDef') are called very frequently there is benefit in having them be very simple.

This patch also includes a bite-sized set of s/el|elNamed/[[/ suggestions.
Comment 25 Peter Haverty 2015-08-17 03:13:15 UTC
Created attachment 1889 [details]
using substr(x,1L,...) rather than grep("^...",x)

simplifications and speedups for base's namespace.R, library.R and attach.R using substr to check namespace "MetaName" prefixes rather than grep(l).  When grep can't use fixed=TRUE, this approach takes ~60% less time.
Comment 26 Martin Maechler 2015-08-18 13:50:00 UTC
(In reply to Peter Haverty from comment #24)
> Created attachment 1888 [details]
> simplifications for 'getClassDef' and 'is'
> 
> There are two simplifications here involving class vectors with a length > 1
> (S3 case). In one case it is significantly cheaper to just subset to the
> first one element rather than checking the length (even if the length is
> already 1L). In the other case, the rest of getClassDef is tolerant to a
> length > 1, so subsetting is unnecessary.  Also, I have a replacement for a
> call to 'identical' that is faster and similarly clear.
> 
> This patch contains a request to deprecate calling 'is' with one argument,
> which is simply a synonym for 'extends'. As these two functions ('is' and
> 'getClassDef') are called very frequently there is benefit in having them be
> very simple.
> 
> This patch also includes a bite-sized set of s/el|elNamed/[[/ suggestions.

Thank you, Pete.
That is interesting... but also a mix of speedup  with something completely different:

Deprecation of functionality of R should not be part of a patch on speedup.
Rather something that be made a good point about in a "RFC"-like way.
Please do propose the deprecation on R-devel so people who are interested can discuss about your argumentation.
Comment 27 Peter Haverty 2015-08-20 23:14:12 UTC
Created attachment 1896 [details]
much simpler and faster 'is'

This is a much simpler version of methods::is. Following a recommendation from Michael Lawrence, I have used 'inherits' rather than 'possibleExtends' in more cases to take advantage of the caching in 'inherits'. The time savings for the 'is' function is > 50%. With the already checked-in patches from this issue, this patch would brings us to a 25% package loading time reduction over R 3.1(, which was quite a bit faster than R 3.0). There will be some improvements for other methods functionality too.

I've checked this by running the tests for a number of S4-heavy Bioconductor packages.

Also, I will follow this patch with an Rdevel RFC to discuss deprecating some alternate usages of 'is' that aren't documented or don't seem to be used.
Comment 28 Peter Haverty 2015-08-21 21:19:33 UTC
Created attachment 1897 [details]
simplifications for getting classes and generics

Removes what looks like vestigial debug code in getClassDef and also a simplification for .getGeneric. Related to RFC sent to r-devel.
Comment 29 Peter Haverty 2015-12-16 02:44:02 UTC
Created attachment 1944 [details]
replacement for simple namespaces speedups

Rather than moving .hasS4Metadata from methods to base, I inlined a simplified version of this one liner to the two places it gets called in base/R/namespace.R
Comment 30 Michael Lawrence 2015-12-16 18:37:36 UTC
Pete gave me a whole plate of cookies left over from someone else's lunch meeting yesterday in exchange for giving these patches some attention. I ate half of one cookie, so I guess I'm now on the hook for these ;)
Comment 31 Peter Haverty 2015-12-16 19:26:14 UTC
Created attachment 1945 [details]
yet newer simple namespace speedups

This is my "simple namespaces speedups" patch modified to get .hasS4MetaData from methods via :::