Bug 15976 - Incorrect parsing of scientific notation
Summary: Incorrect parsing of scientific notation
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R 3.1.0
Hardware: All All
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2014-09-13 21:15 UTC by Siddhartha Bagaria
Modified: 2014-11-05 21:08 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Siddhartha Bagaria 2014-09-13 21:15:35 UTC
Consider these two statements and their results:

> is.nan(0E4932)
[1] FALSE
> is.nan(0E4933)
[1] TRUE

Both these numbers are well understood by gcc and will be correctly parsed as 0. However, when R constructs an SEXPREC structure for the number, it reads 0E4933 as 0xfff8000000000000. I ran a debugging session with gdb and put a breakpoint on do_isnan. When I called is.nan(0E4933), this is what I found:

Breakpoint 1, do_isnan (call=0x2345ae0, op=0x118c680, args=0x2345a00, rho=0x119f640) at coerce.c:2220
2220    coerce.c: No such file or directory.
(gdb) p *(REAL(CAR(args)))
$3 = -nan(0x8000000000000)
(gdb) p *(CAR(args))
$4 = {sxpinfo = {type = 14, obj = 0, named = 2, gp = 0, mark = 0, debug = 0, trace = 0, spare = 0, gcgen = 0, gccls = 1}, attrib = 0x116c9b8,
  gengc_next_node = 0x23ac648, gengc_prev_node = 0x23ac6a8, u = {primsxp = {offset = 1}, symsxp = {pname = 0x1, value = 0xfff8000000000000,
      internal = 0x2000000a}, listsxp = {carval = 0x1, cdrval = 0xfff8000000000000, tagval = 0x2000000a}, envsxp = {frame = 0x1,
      enclos = 0xfff8000000000000, hashtab = 0x2000000a}, closxp = {formals = 0x1, body = 0xfff8000000000000, env = 0x2000000a}, promsxp = {value = 0x1,
      expr = 0xfff8000000000000, env = 0x2000000a}}}
(gdb) p *(args)
$5 = {sxpinfo = {type = 2, obj = 0, named = 0, gp = 0, mark = 0, debug = 0, trace = 0, spare = 0, gcgen = 0, gccls = 0}, attrib = 0x116c9b8,
  gengc_next_node = 0x23459c8, gengc_prev_node = 0x2345a38, u = {primsxp = {offset = 37406328}, symsxp = {pname = 0x23ac678, value = 0x116c9b8,
      internal = 0x116c9b8}, listsxp = {carval = 0x23ac678, cdrval = 0x116c9b8, tagval = 0x116c9b8}, envsxp = {frame = 0x23ac678, enclos = 0x116c9b8,
      hashtab = 0x116c9b8}, closxp = {formals = 0x23ac678, body = 0x116c9b8, env = 0x116c9b8}, promsxp = {value = 0x23ac678, expr = 0x116c9b8,
      env = 0x116c9b8}}}


My session info:
R Under development (unstable) (2014-05-28 r65789)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.2.0
Comment 1 Peter Dalgaard 2014-09-13 22:16:18 UTC
Also,

> as.numeric("0E4933")
[1] NaN

and the whole thing boils down to 0*Inf==NaN

This is happening in R_strtod5 in src/main/utils.c where we have things like

        for (n = expn, fac = 1.0; n; n >>= 1, p10 *= p10)
            if (n & 1) fac *= p10;
        ans *= fac;

and fac can overflow LDOUBLE. Presumably it wouldn't be a big deal just to return 0 if ans==0.
Comment 2 Duncan Murdoch 2014-09-13 22:23:48 UTC
Your title seems right, but your demonstration is really irrelevant.  If you look at the number 0E4933 you'll see that it is a NaN.  Of course is.nan() returns true for it.

The issue is entirely in the constant parsing, not in the is.nan() test.  The parser evaluates that constant as 0 times 10^4933.  The latter overflows to Inf, so the result is NaN.  

We could fix this with a special case of a mantissa of 0, or we could document that xxxEeee is defined to be the same as xxx * 10^eee, in which case the current behaviour is correct.  Or we could do nothing.
Comment 3 Siddhartha Bagaria 2014-09-13 23:20:32 UTC
Yes, I agree my demonstration is indirect. What I omitted from my bug report was that the first thing I tried out was a small program in C to check behavior of GCC and it turns out they correctly parse the number. I guess they must have the special condition that Peter points out in Comment #1. Basically, 0*Inf is NaN, but 0*(numerical overflow) can be safely said to be 0.

I just needed a breakpoint into the program, and since I did not know R_strtod5, the first thing that came to my mind was do_isnan. I just wanted to look into the SEXPREC object constructed for this parsed number, and meanwhile also check that do_isnan is not doing anything funny (it is not).

Thank you for the replies. It is very encouraging. I am happy to send a patch for R_strtod5.

====

#include <math.h>
#include <stdio.h>

static double ZERO = 0E4933;

int main () { 
  printf("Value: %f, isnan: %d\n", ZERO, isnan(ZERO));
}

====

This outputs:
Value: 0.000000, isnan: 0

====
Comment 4 Duncan Murdoch 2014-11-05 21:08:42 UTC
I'll fix this in R-devel and R-patched.