Bug 17372 - Return type of sum() depends on integer overflow
Summary: Return type of sum() depends on integer overflow
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R-devel (trunk)
Hardware: All All
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2018-01-09 15:51 UTC by Alex Bertram
Modified: 2018-01-12 15:14 UTC (History)
3 users (show)

See Also:


Attachments
Patch to ensure argument order does not matter (1.60 KB, patch)
2018-01-09 18:34 UTC, Alex Bertram
Details | Diff
Test cases for all combinations of arguments (207.50 KB, text/x-r-source)
2018-01-09 18:37 UTC, Alex Bertram
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex Bertram 2018-01-09 15:51:17 UTC
The sum() has unexpected behavior when mixing integer vectors, complex vectors, and character vectors.

In most cases, the return type of sum() is determined by the types of its arguments, regardless of the argument order. For example:

sum(3.14, c(1073741824L, 1073741824L)) # NA_real_
sum(c(1073741824L, 1073741824L), 3.14) # NA_real_

However, this is not the case if one of the arguments is a complex vector:

sum(1+2i, c(1073741824L, 1073741824L)) # NA_complex_
sum(c(1073741824L, 1073741824L), 1+2i) # NA_real_

Error handling is also be influenced by the order of arguments.

sum("foo", c(1073741824L, 1073741824L)) # invalid type error
sum(c(1073741824L, 1073741824L), "foo") # NA_real_

This perhaps a bit surprising to users, and it also complicates optimization efforts as it requires computing the sums of all integer vectors before 
the result type is known, or even whether the program continues.

The current implementation includes a comment that suggests that correcting the edge cases would be consistent with the original intent:

/* we need to find out if _all_ the arguments are integer or logical
       in advance, as we might overflow before we find out.  NULL is
       documented to be the same as integer(0).
    */

Will attach a patch shortly.
Comment 1 Alex Bertram 2018-01-09 18:34:30 UTC
Created attachment 2311 [details]
Patch to ensure argument order does not matter
Comment 2 Alex Bertram 2018-01-09 18:37:10 UTC
Created attachment 2312 [details]
Test cases for all combinations of arguments

These test cases should cover all combinations of argument types and presence of NA/NaN/etc, and na.rm = {TRUE, FALSE}
Comment 3 Martin Maechler 2018-01-10 10:46:37 UTC
Thank you, Alex.
I agree the current behavior is undesired.

Your patch looks fine,  but is not quite: In the case of an invalid type, the error message gives the wrong type: 
inserting a
    a = CAR(a);
before the     'goto invalid_type'   does fix that problem.

No need for a changed patch.

Relatedly but different: We could think of improving

str(sum(1+2i, iL))
 cplx NA
Warning message:
In sum(1 + (0+2i), iL) : integer overflow - use sum(as.numeric(.))
> 

as the warning's  as.numeric(.) recommendation is not a particularly good one in this case.  As long as I don't see *elegant* code change for that, I'll let it be....

and would commit a slightly amended version of your patch.
Comment 4 Luke Tierney 2018-01-10 12:10:05 UTC
If you work on this keep in mind we also have this issue reported by Herve Pages:

> x <- numeric(2^31) ## a long vector
> sum(x == 0)
[1] NA
Warning message:
In sum(x == 0) : integer overflow - use sum(as.numeric(.))

We'll probably have to allow sum to return a double when it needs to overflow.
I'm not suggesting this be addressed now, just keep in mind as you change things.
Comment 5 Alex Bertram 2018-01-10 12:30:09 UTC
Thanks Martin for catching the incorrect error message!

Regarding the warning message, I think sum(numeric(.)) is still helpful even
if the return type is complex: sum(2+3i, as.numeric(iL)) will fix the problem
and sum(2+3i, as.complex(iL)) seems exaggerated.
Comment 6 Alex Bertram 2018-01-10 12:36:01 UTC
Luke, regarding Herve's case, if there is to be a significant change to sum's behavior, I would lobby for _always_ promoting the result of logical vectors to double vectors, as I don't personally think that result types should be dependent on the computation: I feel that it makes programs more difficult to reason about for both humans and compilers.

Out of curiosity, I was wondering about the original rationale for integer types in R. Are they motivated by precision, by performance, or by memory concerns?
Comment 7 Luke Tierney 2018-01-10 13:38:09 UTC
For the result type, we already have that situation with length. There are arguments both ways.

On motivation, that's pretty ancient history going back to S. My guess: to make it easier to talk to FORTRAN.
Comment 8 Martin Maechler 2018-01-11 08:48:10 UTC
(In reply to Alex Bertram from comment #5)
> Thanks Martin for catching the incorrect error message!
> 
> Regarding the warning message, I think sum(numeric(.)) is still helpful even
> if the return type is complex: sum(2+3i, as.numeric(iL)) will fix the problem
> and sum(2+3i, as.complex(iL)) seems exaggerated.

hmm, I think you may well be right here.. so even more reason to not change tha.
Comment 9 Martin Maechler 2018-01-11 08:55:49 UTC
(In reply to Luke Tierney from comment #4)
> If you work on this keep in mind we also have this issue reported by Herve
> Pages:
> 
> > x <- numeric(2^31) ## a long vector
> > sum(x == 0)
> [1] NA
> Warning message:
> In sum(x == 0) : integer overflow - use sum(as.numeric(.))
> 
> We'll probably have to allow sum to return a double when it needs to
> overflow.
> I'm not suggesting this be addressed now, just keep in mind as you change
> things.

Thank you, Luke, for remembering this related problem.
I agree that this should be addressed as well, but think it should be another PR.

The changes I plan to commit, Alex' + the error-msg fix (Comment #3) + cosmetic, will not hinder or complicate addressing  Herve's long-vector case .. which I agree should be addressed.

... I think I'd rather commit them now, including regression tests.
Apropos regr.test:  If the long vector case would be addressed as radically as Alex had proposed, it would mostprobably entail several changes in some of our regression checks..
Comment 10 Martin Maechler 2018-01-12 15:14:39 UTC
Patch committed to  R-devel 74107.     Planned to be ported to R 3.4.3 patched.