Bug 16354 - utils::getParseData() fails on expressions containing long character strings
Summary: utils::getParseData() fails on expressions containing long character strings
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R 3.2.0
Hardware: x86_64/x64/amd64 (64-bit) Linux
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-04-29 23:11 UTC by Yihui Xie
Modified: 2015-05-02 13:10 UTC (History)
2 users (show)

See Also:


Attachments
A sample R script to reproduce the problem (1.37 KB, text/x-r-source)
2015-04-29 23:11 UTC, Yihui Xie
Details
test case (1.11 KB, text/plain)
2015-04-30 16:53 UTC, Benjamin Tyner
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yihui Xie 2015-04-29 23:11:35 UTC
Created attachment 1811 [details]
A sample R script to reproduce the problem

Please see the attached R script, which makes getParseData() fail. The character string `odelString` is very long, and seems to be completely discarded in the parsed data:

> getParseData(parse('long.r',keep.source = TRUE))
  line1 col1 line2 col2 id parent     token terminal       text
1     1    1     1   10  1      3    SYMBOL     TRUE odelString
3     1    1     1   10  3      0      expr    FALSE           
2     1   11     1   11  2      0 EQ_ASSIGN     TRUE          =
6     1   12    40    1  6      0      expr    FALSE           

> # by comparison, a shorter string works
> getParseData(parse(text = 'odelString = "A short string"', keep.source = TRUE))
  line1 col1 line2 col2 id parent     token terminal             text
1     1    1     1   10  1      3    SYMBOL     TRUE       odelString
3     1    1     1   10  3      0      expr    FALSE                 
2     1   12     1   12  2      0 EQ_ASSIGN     TRUE                =
4     1   14     1   29  4      6 STR_CONST     TRUE "A short string"
6     1   14     1   29  6      0      expr    FALSE                 

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
Comment 1 Benjamin Tyner 2015-04-30 16:53:45 UTC
Created attachment 1815 [details]
test case

# code to use the test case:

b <- parse(file = "~/badstring.R", keep.source = TRUE)
d <- getParseData(b, includeText = FALSE)
subset(d, line1 == 2L) # gives:
#    line1 col1 line2 col2 id parent token terminal       
# 10     2    5    24    1 10     21  expr    FALSE
subset(d, parent == 10)# gives:
# [1] line1    col1     line2    col2     id       parent   token    terminal
# <0 rows> (or 0-length row.names)
Comment 2 Duncan Murdoch 2015-05-02 13:10:20 UTC
I've found this and will partially fix it.  The issue is that the source text is stored in a buffer with a maximum length of 1000 characters.  When that was exceeded, the text wasn't added to the parse data, and accidentally, the whole record for that token was lost.

I'd rather not make that buffer bigger; instead I'll just replace the text with a message like this:

[1390 chars quoted with '"']

or

[1390 wide chars quoted with '"']

Note that 1390 is the count of characters after processing the escapes, it's not the number of characters in the original source file; the original text is gone by this point.

I'll commit this fix to R-devel and R-patched after some testing.