Bug 16757 - Wishlist: Protect against stack overflow from PCRE
Summary: Wishlist: Protect against stack overflow from PCRE
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R-devel (trunk)
Hardware: x86_64/x64/amd64 (64-bit) Linux
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-03-10 15:19 UTC by Mikko Korpela
Modified: 2016-03-11 08:09 UTC (History)
0 users

See Also:


Attachments
Suggested patch (3.45 KB, patch)
2016-03-10 15:19 UTC, Mikko Korpela
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mikko Korpela 2016-03-10 15:19:16 UTC
Created attachment 2040 [details]
Suggested patch

When using the PCRE library through R, a stack overflow is possible with a "suitable" pattern and subject string. The useR can probably avoid most cases by using a carefully designed pattern or short strings. However, it would be nice if R tried to control PCRE so that these stack overflows were less likely to happen.

The stack usage of PCRE is described in <http://pcre.org/original/doc/html/pcrestack.html>.

Example run on Linux, R-devel r70303, PCRE 8.38, Cstack_info() reporting a stack size of 7969177 bytes. The problem also exists on R 3.2.3.

Code:
    make_a_string <- function(n) {
        paste0(rep("a", n), collapse="")
    }
    pattern <- "(a|b)+"
    first_n <- 8422 # system dependent
    
    for (n in seq(first_n, first_n + 9)) {
        print(n)
        x <- make_a_string(n)
        print(grepl(pattern, x, perl = TRUE))
    }

Output:
[1] 8422
[1] TRUE
[1] 8423
[1] TRUE
[1] 8424
[1] TRUE
[1] 8425
[1] TRUE
[1] 8426
Error: segfault from C stack overflow


The attached patch tries to compute how much space is roughly available on the stack (similarly to do_Cstack_info() in src/main/platform.c) and sets a limit to the number of recursive calls PCRE is allowed to make. The code estimates the stack space occupied by each recursive call to be 600 bytes, but this is really system dependent. There is a 500 byte estimate in the PCRE manual (link above), so this has some safety margin compared to that. Of course the recursive call may actually use more space which may deplete the stack. In other cases the estimate is too high, and a pattern may fail to match whereas the current implementation would have produced a match. In my experiments with PCRE 8.38 on Linux and OS X 10.7.5, the stack usage reported by 'pcretest -m -C' (as suggested by the manual) seems to be unrealistically low.

The patch applies a recursion limit to all cases where pcre_exec() is used in src/main/grep.c. The patch also fixes one case of what looks like incorrect nesting level. If PCRE has been configured to use the heap instead of the stack, no limits are set. Different memory management considerations apply when JIT is available in PCRE and used. However, PCRE JIT is never used by R: <https://stat.ethz.ch/pipermail/r-devel/2015-November/072071.html>.

Example with the patch applied, on the same Linux machine. This does not segfault, but silently fails to match (R does not use the error code of pcre_exec). Clearly the estimated 600 bytes is more than actually used by each recursion on this system, as the longest matching subject string is shorter than without the patch.

Code:
    make_a_string <- function(n) {
        paste0(rep("a", n), collapse="")
    }
    pattern <- "(a|b)+"
    first_n <- 6622 # system dependent

    for (n in c(seq(first_n, first_n + 8), 1e6)) {
        print(n)
        x <- make_a_string(n)
        print(grepl(pattern, x, perl = TRUE))
    }

Output:
[1] 6622
[1] TRUE
[1] 6623
[1] TRUE
[1] 6624
[1] TRUE
[1] 6625
[1] TRUE
[1] 6626
[1] TRUE
[1] 6627
[1] FALSE
[1] 6628
[1] FALSE
[1] 6629
[1] FALSE
[1] 6630
[1] FALSE
[1] 1e+06
[1] FALSE