Bug 16932 - rep() calls rep.int() when the argument "times" is not an integer
rep() calls rep.int() when the argument "times" is not an integer
 Status: CLOSED FIXED None R Unclassified Misc (show other bugs) R 3.2.4 revised x86_64/x64/amd64 (64-bit) Windows 64-bit P5 enhancement R-core

 Reported: 2016-06-02 14:35 UTC by jean.d.marchal 2017-02-27 17:15 UTC (History) 4 users (show) hannes hiebert maechler michafla

Attachments
Change just for 'times' of length 1 (1.29 KB, patch)
2016-10-28 17:09 UTC, Suharto Anggono
Details | Diff
Change just for 'times' of length 1 (1.35 KB, patch)
2016-10-28 17:23 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, restrict change for scalar real 'times' > INT_MAX (2.33 KB, patch)
2016-11-11 18:45 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all (9.42 KB, patch)
2017-01-02 02:05 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all with check + closer 32-bit and 64-bit R (16.96 KB, patch)
2017-01-02 02:09 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all with check + closer 32-bit and 64-bit R (17.23 KB, patch)
2017-01-02 02:16 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all with check + closer 32-bit and 64-bit R (17.29 KB, patch)
2017-01-02 02:42 UTC, Suharto Anggono
Details | Diff
Against R devel r71645, all with check + closer 32-bit and 64-bit R (17.23 KB, patch)
2017-01-02 03:14 UTC, Suharto Anggono
Details | Diff
Against R devel r71920 (1.24 KB, patch)
2017-01-06 16:23 UTC, Suharto Anggono
Details | Diff
Against R devel r71966, rep2: j backwards (1.71 KB, patch)
2017-01-14 10:38 UTC, Suharto Anggono
Details | Diff
'rep2', old and changed (10.78 KB, text/x-csrc)
2017-01-16 17:04 UTC, Suharto Anggono
Details
'rep2' timing script (560 bytes, text/plain)
2017-01-16 17:26 UTC, Suharto Anggono
Details
Against R devel r72026, 'rep.int' always uses 'rep2' for 'times' as long as 'x', like old (353 bytes, patch)
2017-01-25 16:23 UTC, Suharto Anggono
Details | Diff

 Note You need to log in before you can comment on or make changes to this bug.
 jean.d.marchal 2016-06-02 14:35:27 UTC rep(1, times = 2^32) throws an error rep(1, times = 2^32 / 2-1) works rep_len(1, 2^32) works maybe rep() should can rep_len() instead of rep.int() when times is not an integer ? Thanks for the hard work! Jean Michael Lawrence 2016-06-02 16:47:44 UTC There is nothing wrong with rep() calling rep.int(). Pretty sure the ".int" refers to "internal" not "integer". I guess rep.int() could be made to support >=2^32 for the times argument. jean.d.marchal 2016-06-02 17:14:50 UTC That would be great then :). Hannes Mühleisen 2016-07-02 15:34:25 UTC working on a patch jean.d.marchal 2016-07-04 09:10:41 UTC Awesome! :) hiebert 2016-08-22 22:25:57 UTC I'd be keen to have rep.int support rep >= 2^32. There's quite a bit of software out there that allocates a vector before returning a modified result. For example: do.something <- function(x) { result <- rep(NA, length(x)) # Modify x in some way and insert values into result return(result) } Anything using this pattern will fail with a very non-obvious error message. If one is using rep directly, it's straightforward to debug. However, uses such as these are often buried in the depths of many libraries, which makes it a challenge for the user to figure out what's going on. Anything using using this pattern will not be able to scale beyond 4GB of RAM, and it'd be very nice it that could be fixed. How is the patch coming, Hannes? Suharto Anggono 2016-10-28 17:09:26 UTC Created attachment 2175 [details] Change just for 'times' of length 1 Suharto Anggono 2016-10-28 17:23:23 UTC Created attachment 2176 [details] Change just for 'times' of length 1 Martin Maechler 2016-11-01 08:30:37 UTC Thank you! - The change for rep.int() was good and simple - The change for rep() needed more: Otherwise you still get warning about coercion to integer.. I have committed a fix to R-devel yesterday and will port to R-patched within another day or so. Suharto Anggono 2016-11-01 15:57:47 UTC (In reply to Martin Maechler from comment #8) > I have committed a fix to R-devel yesterday and will port to R-patched > within another day or so. The NEWS entry is currently the following. \item \code{rep(x, times)} and \code{rep.int(x, times)} now both work also when \code{times} is larger than the maximal integer. (\PR{16932}) Details that might be added: - It is just for 'times' of length 1. - In 64-bit R, for 'times' of length 1 that is larger than the maximal integer, rep.int(x, times) previously works, except if 'x' is of length 1. A side issue, documentation: - For argument 'length.out' of 'rep', the documentation says: "Other inputs will be coerced to an integer vector and the first element taken." In fact, number larger than .Machine$integer.max is accepted. - The documentation says that argument 'times' of 'rep' or 'rep.int' is a integer vector. In fact, if 'x' is not of length 1 for 'rep.int', the storage mode of 'times' of length 1 can be "double". After the change, it happens for 'rep.int' and 'rep' generally. Suharto Anggono 2016-11-11 18:45:39 UTC Created attachment 2180 [details] Against R devel r71645, restrict change for scalar real 'times' > INT_MAX I was surprised by r71638, "replace rep(*, list(.)) error by deprecation warning for now". So, if TYPEOF(x) is VECSXP (a list), asInteger(x) always gives NA_INTEGER, may be different from INTEGER(coerceVector(x, INTSXP))[0]. Similar change in behavior occurs in rep.int(x, times) for 'x' of length 1. Another, subtle, change is that times = -0.5 becomes an error in 64-bit R. To ensure that previous call to 'rep' or 'rep.int' without error stays so, change can be restricted to 'times' that is a length-1 vector with storage mode "double" and value larger than .Machine$integer.max. Martin Maechler 2016-11-12 20:27:58 UTC (In reply to Suharto Anggono from comment #10) > Created attachment 2180 [details] > Against R devel r71645, restrict change for scalar real 'times' > INT_MAX > I see... This is actually what an R core colleague would have wanted. Personally, I've remained pretty convinced that keeping to allow list() etc, has never been according to the documentation, and hence should be deprecated (to become an error eventually). > > I was surprised by r71638, "replace rep(*, list(.)) error by deprecation > warning for now". So, if TYPEOF(x) is VECSXP (a list), asInteger(x) always > gives NA_INTEGER, may be different from INTEGER(coerceVector(x, INTSXP))[0]. > > Similar change in behavior occurs in rep.int(x, times) for 'x' of length 1. Indeed, yes, in R 3.3.2 (or earlier), > rep.int(4, list(7)) [1] 4 4 4 4 4 4 4 > Another, subtle, change is that times = -0.5 becomes an error in 64-bit R. Thanks... but I think that's no problem > To ensure that previous call to 'rep' or 'rep.int' without error stays so, > change can be restricted to 'times' that is a length-1 vector with storage > mode "double" and value larger than .Machine\$integer.max. Yes, thank you. I'll keep thinking about your proposal. For efficiency in some cases, it may be (very very slightly) advantageous to keep allowing list()s as times argument..... and if we use (something like) your proposed patch, we would just tell the user in the help file that times indeed is "internally coerced to integer similar to as.integer(.) (but without method dispatch)" where the ending 2/3 of the sentence is alluding to the use of coerceVector(*, INTSXP). Suharto Anggono 2017-01-02 02:05:32 UTC Created attachment 2207 [details] Against R devel r71645, all This uses double in 64-bit R. Suharto Anggono 2017-01-02 02:09:48 UTC Created attachment 2208 [details] Against R devel r71645, all with check + closer 32-bit and 64-bit R Suharto Anggono 2017-01-02 02:16:03 UTC Created attachment 2209 [details] Against R devel r71645, all with check + closer 32-bit and 64-bit R Suharto Anggono 2017-01-02 02:42:40 UTC Created attachment 2210 [details] Against R devel r71645, all with check + closer 32-bit and 64-bit R Suharto Anggono 2017-01-02 03:14:42 UTC Created attachment 2211 [details] Against R devel r71645, all with check + closer 32-bit and 64-bit R Back to 2209. I am sorry for the noise. In 'do_rep', checking 'times' is done when lx > 0. Suharto Anggono 2017-01-02 05:32:08 UTC In function 'do_rep' in seq.c, 'nprotect' is initialized to 3. Is it correct? I see 2 PROTECT calls before 'nprotect' is first incremented. Martin Maechler 2017-01-04 11:52:47 UTC (In reply to Suharto Anggono from comment #17) > In function 'do_rep' in seq.c, 'nprotect' is initialized to 3. Is it > correct? I see 2 PROTECT calls before 'nprotect' is first incremented. You are right that it *looks* wrong. Setting to '2' however fails, too, because one 'nprotect++' has been missing later in the code. In this sense, the '3' is correct. I'm currently testing (a version of) your patch ... and I have already changed that, now initializing nprotect = 2 ... and then using nprotect++ every time PROTECT() is called. Suharto Anggono 2017-01-06 16:23:59 UTC Created attachment 2212 [details] Against R devel r71920 I made this because I thought that INTEGER(t) might be unsafe or undesired for zero-length integer vector 't' (TYPEOF(t) == INTSXP and XLENGTH(t) == 0). Of course, instead of changing, adding 'if' is also possible. Martin Maechler 2017-01-06 21:30:45 UTC (In reply to Suharto Anggono from comment #19) > Created attachment 2212 [details] > Against R devel r71920 > > I made this because I thought that INTEGER(t) might be unsafe or undesired > for zero-length integer vector 't' (TYPEOF(t) == INTSXP and XLENGTH(t) == > 0). Of course, instead of changing, adding 'if' is also possible. Your proposed change indeed a bit more elegant code, ... though really not changing any behavior I think. I will eventually try to apply it... working on seq.c anyway because of the seq.int() error messages etc. Suharto Anggono 2017-01-14 10:38:10 UTC Created attachment 2213 [details] Against R devel r71966, rep2: j backwards From my little testing, 'rep.int' with a vector with storage mode "double" as 'times' argument seems to become slower. Saving (R_xlen_t) REAL(t)[i] in a variable seems to help. Without introducing a new variable, counting backwards can be done. Martin Maechler 2017-01-15 17:40:17 UTC (In reply to Suharto Anggono from comment #21) > Created attachment 2213 [details] > Against R devel r71966, rep2: j backwards > > From my little testing, 'rep.int' with a vector with storage mode "double" > as 'times' argument seems to become slower. Saving > (R_xlen_t) REAL(t)[i] > in a variable seems to help. Without introducing a new variable, counting > backwards can be done. Interesting. I would have expected compiler optimization to take care of this (so that we would not see a difference). Are you using at least -O2 ? 2nd question: what speed measurements did you do exactly? Suharto Anggono 2017-01-16 17:02:07 UTC (In reply to Martin Maechler from comment #22) > (In reply to Suharto Anggono from comment #21) > > Created attachment 2213 [details] > > Against R devel r71966, rep2: j backwards > > > > From my little testing, 'rep.int' with a vector with storage mode "double" > > as 'times' argument seems to become slower. Saving > > (R_xlen_t) REAL(t)[i] > > in a variable seems to help. Without introducing a new variable, counting > > backwards can be done. > > Interesting. I would have expected compiler optimization to take care of > this (so that we would not see a difference). Are you using at least -O2 ? > > 2nd question: what speed measurements did you do exactly? I used -O2. I looked at elapsed time. Suharto Anggono 2017-01-16 17:04:25 UTC Created attachment 2214 [details] 'rep2', old and changed Suharto Anggono 2017-01-16 17:26:33 UTC Created attachment 2215 [details] 'rep2' timing script To produce rep2.dll, I used gcc -std=c99 -pedantic-errors -Wall -O2 -shared -s -o rep2.dll rep2.c -I"C:\Documents and Settings\personal\My Documents\R\R-3.3.2\include" -L"C:\Documents and Settings\personal\My Documents\R\R-3.3.2\bin\i386" -lR Output: , , x = double, n = double f rep2 rep2_new rep2_new2v rep2_new2c [1,] 0.30 0.38 0.39 0.28 [2,] 0.28 0.39 0.39 0.28 , , x = logical, n = double f rep2 rep2_new rep2_new2v rep2_new2c [1,] 0.16 0.12 0.16 0.15 [2,] 0.13 0.17 0.16 0.14 , , x = character, n = double f rep2 rep2_new rep2_new2v rep2_new2c [1,] 0.44 0.68 0.65 0.44 [2,] 0.45 0.76 0.70 0.43 , , x = raw, n = double f rep2 rep2_new rep2_new2v rep2_new2c [1,] 0.06 0.22 0.05 0.10 [2,] 0.05 0.22 0.07 0.05 , , x = double, n = integer f rep2 rep2_new rep2_new2v rep2_new2c [1,] 0.28 0.29 0.30 0.28 [2,] 0.30 0.29 0.28 0.28 , , x = logical, n = integer f rep2 rep2_new rep2_new2v rep2_new2c [1,] 0.15 0.13 0.15 0.15 [2,] 0.13 0.16 0.16 0.17 , , x = character, n = integer f rep2 rep2_new rep2_new2v rep2_new2c [1,] 0.44 0.45 0.58 0.45 [2,] 0.44 0.50 0.46 0.47 , , x = raw, n = integer f rep2 rep2_new rep2_new2v rep2_new2c [1,] 0.06 0.05 0.05 0.08 [2,] 0.05 0.04 0.05 0.05 R version 3.3.2 (2016-10-31) Platform: i386-w64-mingw32/i386 (32-bit) Running under: Windows XP (build 2600) Service Pack 2 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base Changed 'rep2' ('rep2_new') seems not slower in replicating logical vector. Just saving (R_xlen_t) REAL(t)[i] in a variable ('rep2_new2v') seems not so helpful in replicating a vector with mode other than "raw". Initially, I tried replicating a raw vector only. Martin Maechler 2017-01-17 09:09:15 UTC All this is taking too much human time for not very convincing reasons. To keep our (human) time small, I'm applying the simple "count backward" patch now ... this time, as an exception and because the patch is small and hard to imagine a situation where it could decrease performance. HOWEVER, such topics are *NOT* something I would want to spend time on in the future - quite platform, compiler, .... dependent - as we see here, for the important cases (int, double) the gain is negligible - there are much more interesting and relevant problems to tackle. - we'd want to see complete R code (and C code in this case, you did provide) for the performance measurements from the beginning (Here, you never showed the 'x' and 'n' you used, nor gave the R code to save our time when running on different platforms) Really I won't do this anymore in such cases. Suharto Anggono 2017-01-17 15:38:05 UTC (In reply to Martin Maechler from comment #26) > All this is taking too much human time for not very convincing reasons. > To keep our (human) time small, > I'm applying the simple "count backward" patch now ... > this time, as an exception and because the patch is small and hard to > imagine a situation where it could decrease performance. > > HOWEVER, such topics are *NOT* something I would want to spend time on in > the future > - quite platform, compiler, .... dependent > - as we see here, for the important cases (int, double) the gain is > negligible > - there are much more interesting and relevant problems to tackle. > - we'd want to see complete R code (and C code in this case, you did provide) > for the performance measurements from the beginning > (Here, you never showed the 'x' and 'n' you used, nor gave the R code to > save our time > when running on different platforms) > > Really I won't do this anymore in such cases. I did provide R code: attachment 2215 [details], "'rep2' timing script". This is in the file. n <- list(2e7, as.integer(2e7)) x <- list(1.0, TRUE, "1", as.raw(1)) In RStudio in Data Scientist Workbench (Debian, R CMD SHLIB to produce .so), I also have tried .Call("rep2_new", 1, 2^31-1, PACKAGE="rep2") . From my memory, the elapsed time was longer than the original 'rep.int' by about 1 second, about 5%. I raised the slowdown issue because I thought (from the documentation) that 'rep.int' was supposed to be fast, so noticeable speed reduction was not desired, even if not large. I might be wrong. Suharto Anggono 2017-01-25 16:23:35 UTC Created attachment 2221 [details] Against R devel r72026, 'rep.int' always uses 'rep2' for 'times' as long as 'x', like old This part is forgotten from attachment 2211 [details]. By this, rep.int(, ) goes to 'rep2', as in R 3.3.2. Using 'rep2' is faster than 'rep3' in some cases. Suharto Anggono 2017-01-26 15:24:18 UTC (In reply to Suharto Anggono from comment #28) > Created attachment 2221 [details] > Against R devel r72026, 'rep.int' always uses 'rep2' for 'times' as long as > 'x', like old I forgot the real reason: with this, rep.int(4, list(7)) (Comment 11) works again. Suharto Anggono 2017-02-24 16:22:57 UTC In R 3.3.3 beta r72254, in "CHANGES IN R 3.3.2 patched" in NEWS, an item is the following. \item \code{rep(x, times)} and \code{rep.int(x, times)} now both work also when \code{times} is larger than the maximal integer, including when it is of length greater than one. (\PR{16932}) The "including when it is of length greater than one" part doesn't hold. It is just for 'times' of length 1. Also, as discovered earlier, rep(4, list(7)) and rep.int(4, list(7)) give error. The 'rep.int' case also gives error in R devel r72254. Please look at attachment 2221 [details]. Martin Maechler 2017-02-27 08:48:33 UTC I've changed the NEWS entry for R-patched (and moved the more general statement to a new entry for R-devel). The changes were quite substantial so I did not port them to R-patched at the time.. We could have ported them later, but not now, immediately before code freeze. The rep(5, list(6)) was never "meant to" work and had worked incidentally only. Martin Maechler 2017-02-27 17:15:36 UTC (In reply to Martin Maechler from comment #31 > The rep(5, list(6)) was never "meant to" work and had worked incidentally > only. But then I had promised to think about the issue, and notably with the nice simple change you've proposed, I have reverted to previous behavior, following our own emphasis: In R, a vector is much more general class than just atomic vectors.