Bug 14627

Summary: agrep crashes with long pattern/target
Product: R Reporter: Duncan Murdoch <murdoch>
Component: Low-levelAssignee: R-core <R-core>
Status: CLOSED FIXED    
Severity: major CC: simon.urbanek
Priority: P5    
Version: R-devel (trunk)   
Hardware: All   
OS: Mac OS X v10.6   

Description Duncan Murdoch 2011-07-13 13:27:29 UTC
This affects multiple versions including R-devel.

Setting the pattern and targets to very long strings, e.g.

x <- pattern <- paste(rep("a ", 600), collapse="")

causes agrep() to fail.  Simply doing

agrep(pattern, x)

finds no match even though pattern == x, and

agrep(pattern, x, max.distance=0.5)

causes a segfault in a tre function, following this error reported by valgrind:

> agrep(pattern, x, max.distance=0.5)
==25522== Invalid write of size 4
==25522==    at 0x5072431: tre_tnfa_run_approx (tre-match-approx.c:822)
==25522==    by 0xC49E7CF: ???
==25522==    by 0xC49E85F: ???
(many more similar lines deleted)
Comment 1 Simon Urbanek 2011-07-13 20:45:33 UTC
Few nasty bugfixes later we should be there (R-devel). This triggered several bugs - one was stack trampling in tre-match-approx the other was stack overflow due to recursion in tre_ast_to_tnfa. Finally in fixing the former has triggered a bug in the definition of R_assert used in TRE.

The current fix is to raise an R error due to TRE assertion violation. A proper cure would be to allow growing buffer instead of the fixed ring buffer in tre_tnfa_run_approx() and I'll probably have a shot at it later.
Comment 2 Simon Urbanek 2011-07-13 21:23:49 UTC
It should be now completely fixed:

> x <- pattern <- paste(rep("a ", 600), collapse="")
> agrep(pattern, x)
[1] 1
> agrep(pattern, x, max.distance=0.5)
[1] 1
> 

tre_tnfa_run_approx() now uses a dynamically growing ring buffer (was fixed to 512 before).
Some of these fixes have speed implications because there may be additional buffer allocations, but they are probably negligible (the ring buffer switches from stack to heap when needed, but there is no equivalent switch from recursion to iteration).

Note, however, that TRE does impose additional limits that are arbitrary, for example:
> x <- pattern <- paste(rep("a ", 6000), collapse="")
> agrep(pattern, x)
Error in agrep(pattern, x) : regcomp error:  'Out of memory'