Bug 16932 - rep() calls rep.int() when the argument "times" is not an integer
Summary: rep() calls rep.int() when the argument "times" is not an integer
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Misc (show other bugs)
Version: R 3.2.4 revised
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-06-02 14:35 UTC by jean.d.marchal
Modified: 2017-02-27 17:15 UTC (History)
4 users (show)

See Also:


Attachments
Change just for 'times' of length 1 (1.29 KB, patch)
2016-10-28 17:09 UTC, Suharto Anggono
Details | Diff
Change just for 'times' of length 1 (1.35 KB, patch)
2016-10-28 17:23 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, restrict change for scalar real 'times' > INT_MAX (2.33 KB, patch)
2016-11-11 18:45 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all (9.42 KB, patch)
2017-01-02 02:05 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all with check + closer 32-bit and 64-bit R (16.96 KB, patch)
2017-01-02 02:09 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all with check + closer 32-bit and 64-bit R (17.23 KB, patch)
2017-01-02 02:16 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all with check + closer 32-bit and 64-bit R (17.29 KB, patch)
2017-01-02 02:42 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all with check + closer 32-bit and 64-bit R (17.23 KB, patch)
2017-01-02 03:14 UTC, Suharto Anggono
Details | Diff
Against R devel r71920 (1.24 KB, patch)
2017-01-06 16:23 UTC, Suharto Anggono
Details | Diff
Against R devel r71966, rep2: j backwards (1.71 KB, patch)
2017-01-14 10:38 UTC, Suharto Anggono
Details | Diff
'rep2', old and changed (10.78 KB, text/x-csrc)
2017-01-16 17:04 UTC, Suharto Anggono
Details
'rep2' timing script (560 bytes, text/plain)
2017-01-16 17:26 UTC, Suharto Anggono
Details
Against R devel r72026, 'rep.int' always uses 'rep2' for 'times' as long as 'x', like old (353 bytes, patch)
2017-01-25 16:23 UTC, Suharto Anggono
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description jean.d.marchal 2016-06-02 14:35:27 UTC
rep(1, times = 2^32) throws an error
rep(1, times = 2^32 / 2-1) works
rep_len(1, 2^32) works

maybe rep() should can rep_len() instead of rep.int() when times is not an integer ?

Thanks for the hard work!

Jean
Comment 1 Michael Lawrence 2016-06-02 16:47:44 UTC
There is nothing wrong with rep() calling rep.int(). Pretty sure the ".int" refers to "internal" not "integer". I guess rep.int() could be made to support >=2^32 for the times argument.
Comment 2 jean.d.marchal 2016-06-02 17:14:50 UTC
That would be great then :).
Comment 3 Hannes Mühleisen 2016-07-02 15:34:25 UTC
working on a patch
Comment 4 jean.d.marchal 2016-07-04 09:10:41 UTC
Awesome! :)
Comment 5 hiebert 2016-08-22 22:25:57 UTC
I'd be keen to have rep.int support rep >= 2^32. There's quite a bit of software out there that allocates a vector before returning a modified result. For example:

do.something <- function(x) {
  result <- rep(NA, length(x))
  # Modify x in some way and insert values into result
  return(result)
}

Anything using this pattern will fail with a very non-obvious error message. If one is using rep directly, it's straightforward to debug. However, uses such as these are often buried in the depths of many libraries, which makes it a challenge for the user to figure out what's going on.

Anything using using this pattern will not be able to scale beyond 4GB of RAM, and it'd be very nice it that could be fixed. How is the patch coming, Hannes?
Comment 6 Suharto Anggono 2016-10-28 17:09:26 UTC
Created attachment 2175 [details]
Change just for 'times' of length 1
Comment 7 Suharto Anggono 2016-10-28 17:23:23 UTC
Created attachment 2176 [details]
Change just for 'times' of length 1
Comment 8 Martin Maechler 2016-11-01 08:30:37 UTC
Thank you!
- The change for rep.int() was good and simple

- The change for rep() needed more: Otherwise you still get warning about coercion to integer..

  I have committed a fix to R-devel yesterday and will port to R-patched within another day or so.
Comment 9 Suharto Anggono 2016-11-01 15:57:47 UTC
(In reply to Martin Maechler from comment #8)
>   I have committed a fix to R-devel yesterday and will port to R-patched
> within another day or so.

The NEWS entry is currently the following.
      \item \code{rep(x, times)} and \code{rep.int(x, times)} now both
      work also when \code{times} is larger than the maximal integer.
      (\PR{16932})

Details that might be added:
- It is just for 'times' of length 1.
- In 64-bit R, for 'times' of length 1 that is larger than the maximal integer, rep.int(x, times) previously works, except if 'x' is of length 1.


A side issue, documentation:
- For argument 'length.out' of 'rep', the documentation says: "Other inputs will be coerced to an integer vector and the first element taken." In fact, number larger than .Machine$integer.max is accepted.
- The documentation says that argument 'times' of 'rep' or 'rep.int' is a integer vector. In fact, if 'x' is not of length 1 for 'rep.int', the storage mode of 'times' of length 1 can be "double". After the change, it happens for 'rep.int' and 'rep' generally.
Comment 10 Suharto Anggono 2016-11-11 18:45:39 UTC
Created attachment 2180 [details]
Against R devel r71645, restrict change for scalar real 'times' > INT_MAX

I was surprised by r71638, "replace rep(*, list(.)) error by deprecation warning for now". So, if TYPEOF(x) is VECSXP (a list), asInteger(x) always gives NA_INTEGER, may be different from INTEGER(coerceVector(x, INTSXP))[0].

Similar change in behavior occurs in rep.int(x, times) for 'x' of length 1.

Another, subtle, change is that times = -0.5 becomes an error in 64-bit R.

To ensure that previous call to 'rep' or 'rep.int' without error stays so, change can be restricted to 'times' that is a length-1 vector with storage mode "double" and value larger than .Machine$integer.max.
Comment 11 Martin Maechler 2016-11-12 20:27:58 UTC
(In reply to Suharto Anggono from comment #10)
> Created attachment 2180 [details]
> Against R devel r71645, restrict change for scalar real 'times' > INT_MAX
>
I see...   This is actually what an R core colleague would have wanted.
Personally, I've remained pretty convinced that keeping to allow list() etc, has never been according to the documentation, and hence should be deprecated (to become an error eventually).

> 
> I was surprised by r71638, "replace rep(*, list(.)) error by deprecation
> warning for now". So, if TYPEOF(x) is VECSXP (a list), asInteger(x) always
> gives NA_INTEGER, may be different from INTEGER(coerceVector(x, INTSXP))[0].
> 
> Similar change in behavior occurs in rep.int(x, times) for 'x' of length 1.

Indeed, yes,  in  R 3.3.2 (or earlier),

  > rep.int(4, list(7))
  [1] 4 4 4 4 4 4 4
 
> Another, subtle, change is that times = -0.5 becomes an error in 64-bit R.

Thanks... but I think that's no problem


> To ensure that previous call to 'rep' or 'rep.int' without error stays so,
> change can be restricted to 'times' that is a length-1 vector with storage
> mode "double" and value larger than .Machine$integer.max.

Yes, thank you.   I'll keep thinking about your proposal.  For efficiency in some cases, it may be (very very slightly) advantageous to keep allowing  list()s  as `times` argument..... and if we use (something like) your proposed patch, we would just tell the user in the help file that times indeed is       "internally coerced to integer similar to as.integer(.) (but without method dispatch)"   

where the ending 2/3 of the sentence is alluding to the use of  coerceVector(*, INTSXP).
Comment 12 Suharto Anggono 2017-01-02 02:05:32 UTC
Created attachment 2207 [details]
Against R devel r71645, all

This uses double in 64-bit R.
Comment 13 Suharto Anggono 2017-01-02 02:09:48 UTC
Created attachment 2208 [details]
Against R devel r71645, all with check + closer 32-bit and 64-bit R
Comment 14 Suharto Anggono 2017-01-02 02:16:03 UTC
Created attachment 2209 [details]
Against R devel r71645, all with check + closer 32-bit and 64-bit R
Comment 15 Suharto Anggono 2017-01-02 02:42:40 UTC
Created attachment 2210 [details]
Against R devel r71645, all with check + closer 32-bit and 64-bit R
Comment 16 Suharto Anggono 2017-01-02 03:14:42 UTC
Created attachment 2211 [details]
Against R devel r71645, all with check + closer 32-bit and 64-bit R

Back to 2209. I am sorry for the noise. In 'do_rep', checking 'times' is done when lx > 0.
Comment 17 Suharto Anggono 2017-01-02 05:32:08 UTC
In function 'do_rep' in seq.c, 'nprotect' is initialized to 3. Is it correct? I see 2 PROTECT calls before 'nprotect' is first incremented.
Comment 18 Martin Maechler 2017-01-04 11:52:47 UTC
(In reply to Suharto Anggono from comment #17)
> In function 'do_rep' in seq.c, 'nprotect' is initialized to 3. Is it
> correct? I see 2 PROTECT calls before 'nprotect' is first incremented.

You are right that it *looks* wrong. Setting to '2' however fails, too, because one 'nprotect++' has been missing later in the code.  In this sense, the '3' is correct.  

I'm currently testing (a version of) your patch ... 
and I have already changed that, now initializing nprotect = 2 ... and then using  nprotect++ every time PROTECT() is called.
Comment 19 Suharto Anggono 2017-01-06 16:23:59 UTC
Created attachment 2212 [details]
Against R devel r71920

I made this because I thought that INTEGER(t) might be unsafe or undesired for zero-length integer vector 't' (TYPEOF(t) == INTSXP and XLENGTH(t) == 0). Of course, instead of changing, adding 'if' is also possible.
Comment 20 Martin Maechler 2017-01-06 21:30:45 UTC
(In reply to Suharto Anggono from comment #19)
> Created attachment 2212 [details]
> Against R devel r71920
> 
> I made this because I thought that INTEGER(t) might be unsafe or undesired
> for zero-length integer vector 't' (TYPEOF(t) == INTSXP and XLENGTH(t) ==
> 0). Of course, instead of changing, adding 'if' is also possible.

Your proposed change  indeed a bit more elegant code, ... though really not changing any behavior I think.

I will eventually try to apply it... working on seq.c  anyway because of the 
seq.int()  error messages etc.
Comment 21 Suharto Anggono 2017-01-14 10:38:10 UTC
Created attachment 2213 [details]
Against R devel r71966, rep2: j backwards

From my little testing, 'rep.int' with a vector with storage mode "double" as 'times' argument seems to become slower. Saving
(R_xlen_t) REAL(t)[i]
in a variable seems to help. Without introducing a new variable, counting backwards can be done.
Comment 22 Martin Maechler 2017-01-15 17:40:17 UTC
(In reply to Suharto Anggono from comment #21)
> Created attachment 2213 [details]
> Against R devel r71966, rep2: j backwards
> 
> From my little testing, 'rep.int' with a vector with storage mode "double"
> as 'times' argument seems to become slower. Saving
> (R_xlen_t) REAL(t)[i]
> in a variable seems to help. Without introducing a new variable, counting
> backwards can be done.

Interesting. I would have expected compiler optimization to take care of this (so that we would not see a difference).  Are you using at least -O2 ?

2nd question: what speed measurements did you do exactly?
Comment 23 Suharto Anggono 2017-01-16 17:02:07 UTC
(In reply to Martin Maechler from comment #22)
> (In reply to Suharto Anggono from comment #21)
> > Created attachment 2213 [details]
> > Against R devel r71966, rep2: j backwards
> > 
> > From my little testing, 'rep.int' with a vector with storage mode "double"
> > as 'times' argument seems to become slower. Saving
> > (R_xlen_t) REAL(t)[i]
> > in a variable seems to help. Without introducing a new variable, counting
> > backwards can be done.
> 
> Interesting. I would have expected compiler optimization to take care of
> this (so that we would not see a difference).  Are you using at least -O2 ?
> 
> 2nd question: what speed measurements did you do exactly?

I used -O2.

I looked at elapsed time.
Comment 24 Suharto Anggono 2017-01-16 17:04:25 UTC
Created attachment 2214 [details]
'rep2', old and changed
Comment 25 Suharto Anggono 2017-01-16 17:26:33 UTC
Created attachment 2215 [details]
'rep2' timing script

To produce rep2.dll, I used
gcc -std=c99 -pedantic-errors -Wall -O2 -shared -s -o rep2.dll rep2.c -I"C:\Documents and Settings\personal\My Documents\R\R-3.3.2\include" -L"C:\Documents and Settings\personal\My Documents\R\R-3.3.2\bin\i386" -lR

Output:

, , x = double, n = double

      f
       rep2 rep2_new rep2_new2v rep2_new2c
  [1,] 0.30     0.38       0.39       0.28
  [2,] 0.28     0.39       0.39       0.28

, , x = logical, n = double

      f
       rep2 rep2_new rep2_new2v rep2_new2c
  [1,] 0.16     0.12       0.16       0.15
  [2,] 0.13     0.17       0.16       0.14

, , x = character, n = double

      f
       rep2 rep2_new rep2_new2v rep2_new2c
  [1,] 0.44     0.68       0.65       0.44
  [2,] 0.45     0.76       0.70       0.43

, , x = raw, n = double

      f
       rep2 rep2_new rep2_new2v rep2_new2c
  [1,] 0.06     0.22       0.05       0.10
  [2,] 0.05     0.22       0.07       0.05

, , x = double, n = integer

      f
       rep2 rep2_new rep2_new2v rep2_new2c
  [1,] 0.28     0.29       0.30       0.28
  [2,] 0.30     0.29       0.28       0.28

, , x = logical, n = integer

      f
       rep2 rep2_new rep2_new2v rep2_new2c
  [1,] 0.15     0.13       0.15       0.15
  [2,] 0.13     0.16       0.16       0.17

, , x = character, n = integer

      f
       rep2 rep2_new rep2_new2v rep2_new2c
  [1,] 0.44     0.45       0.58       0.45
  [2,] 0.44     0.50       0.46       0.47

, , x = raw, n = integer

      f
       rep2 rep2_new rep2_new2v rep2_new2c
  [1,] 0.06     0.05       0.05       0.08
  [2,] 0.05     0.04       0.05       0.05

R version 3.3.2 (2016-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows XP (build 2600) Service Pack 2

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     


Changed 'rep2' ('rep2_new') seems not slower in replicating logical vector.
Just saving
(R_xlen_t) REAL(t)[i]
in a variable ('rep2_new2v') seems not so helpful in replicating a vector with mode other than "raw".

Initially, I tried replicating a raw vector only.
Comment 26 Martin Maechler 2017-01-17 09:09:15 UTC
All this is taking too much human time for not very convincing reasons.
To keep our (human) time small,
I'm applying the simple  "count backward" patch now ... 
this time, as an exception and because the patch is small and hard to imagine a situation where it could decrease performance.

HOWEVER, such topics are *NOT* something I would want to spend time on in the future
- quite platform, compiler, .... dependent
- as we see here, for the important cases (int, double) the gain is negligible
- there are much more interesting and relevant problems to tackle.
- we'd want to see complete R code (and C code in this case, you did provide)
  for the performance measurements from the beginning
  (Here, you never showed the 'x' and 'n' you used, nor gave the R code to save our time 
   when running on different platforms)

Really I won't do this anymore in such cases.
Comment 27 Suharto Anggono 2017-01-17 15:38:05 UTC
(In reply to Martin Maechler from comment #26)
> All this is taking too much human time for not very convincing reasons.
> To keep our (human) time small,
> I'm applying the simple  "count backward" patch now ... 
> this time, as an exception and because the patch is small and hard to
> imagine a situation where it could decrease performance.
> 
> HOWEVER, such topics are *NOT* something I would want to spend time on in
> the future
> - quite platform, compiler, .... dependent
> - as we see here, for the important cases (int, double) the gain is
> negligible
> - there are much more interesting and relevant problems to tackle.
> - we'd want to see complete R code (and C code in this case, you did provide)
>   for the performance measurements from the beginning
>   (Here, you never showed the 'x' and 'n' you used, nor gave the R code to
> save our time 
>    when running on different platforms)
> 
> Really I won't do this anymore in such cases.

I did provide R code: attachment 2215 [details], "'rep2' timing script".
This is in the file.
n <- list(2e7, as.integer(2e7))
x <- list(1.0, TRUE, "1", as.raw(1))

In RStudio in Data Scientist Workbench (Debian, R CMD SHLIB to produce .so), I also have tried
.Call("rep2_new", 1, 2^31-1, PACKAGE="rep2") .
From my memory, the elapsed time was longer than the original 'rep.int' by about 1 second, about 5%.

I raised the slowdown issue because I thought (from the documentation) that 'rep.int' was supposed to be fast, so noticeable speed reduction was not desired, even if not large. I might be wrong.
Comment 28 Suharto Anggono 2017-01-25 16:23:35 UTC
Created attachment 2221 [details]
Against R devel r72026, 'rep.int' always uses 'rep2' for 'times' as long as 'x', like old

This part is forgotten from attachment 2211 [details].
By this, rep.int(<scalar>, <scalar>) goes to 'rep2', as in R 3.3.2.
Using 'rep2' is faster than 'rep3' in some cases.
Comment 29 Suharto Anggono 2017-01-26 15:24:18 UTC
(In reply to Suharto Anggono from comment #28)
> Created attachment 2221 [details]
> Against R devel r72026, 'rep.int' always uses 'rep2' for 'times' as long as
> 'x', like old

I forgot the real reason: with this,
rep.int(4, list(7))
(Comment 11) works again.
Comment 30 Suharto Anggono 2017-02-24 16:22:57 UTC
In R 3.3.3 beta r72254, in "CHANGES IN R 3.3.2 patched" in NEWS, an item is the following.
      \item \code{rep(x, times)} and \code{rep.int(x, times)} now both
      work also when \code{times} is larger than the maximal integer,
      including when it is of length greater than one. (\PR{16932})

The "including when it is of length greater than one" part doesn't hold. It is just for 'times' of length 1.

Also, as discovered earlier,
rep(4, list(7))
and
rep.int(4, list(7))
give error.

The 'rep.int' case also gives error in R devel r72254. Please look at attachment 2221 [details].
Comment 31 Martin Maechler 2017-02-27 08:48:33 UTC
I've changed the NEWS entry for R-patched (and moved the more general statement to a new entry for R-devel).

The changes were quite substantial so I did not port them to R-patched at the time..  We could have ported them later, but not now, immediately before code freeze.
 
The  rep(5, list(6))   was never "meant to" work and had worked incidentally only.
Comment 32 Martin Maechler 2017-02-27 17:15:36 UTC
(In reply to Martin Maechler from comment #31
> The  rep(5, list(6))   was never "meant to" work and had worked incidentally
> only.

But then I had promised to think about the issue, and notably with the nice
simple change you've proposed, I have reverted to previous behavior,
following our own emphasis:   In R, a vector is much more general class than just atomic vectors.