Bug 14627 - agrep crashes with long pattern/target
agrep crashes with long pattern/target
Status: RESOLVED FIXED
Product: R
Classification: Unclassified
Component: Low-level
R-devel (trunk)
All Mac OS X v10.6
: P5 major
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-07-13 13:27 UTC by Duncan Murdoch
Modified: 2011-07-13 21:23 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Duncan Murdoch 2011-07-13 13:27:29 UTC
This affects multiple versions including R-devel.

Setting the pattern and targets to very long strings, e.g.

x <- pattern <- paste(rep("a ", 600), collapse="")

causes agrep() to fail.  Simply doing

agrep(pattern, x)

finds no match even though pattern == x, and

agrep(pattern, x, max.distance=0.5)

causes a segfault in a tre function, following this error reported by valgrind:

> agrep(pattern, x, max.distance=0.5)
==25522== Invalid write of size 4
==25522==    at 0x5072431: tre_tnfa_run_approx (tre-match-approx.c:822)
==25522==    by 0xC49E7CF: ???
==25522==    by 0xC49E85F: ???
(many more similar lines deleted)
Comment 1 Simon Urbanek 2011-07-13 20:45:33 UTC
Few nasty bugfixes later we should be there (R-devel). This triggered several bugs - one was stack trampling in tre-match-approx the other was stack overflow due to recursion in tre_ast_to_tnfa. Finally in fixing the former has triggered a bug in the definition of R_assert used in TRE.

The current fix is to raise an R error due to TRE assertion violation. A proper cure would be to allow growing buffer instead of the fixed ring buffer in tre_tnfa_run_approx() and I'll probably have a shot at it later.
Comment 2 Simon Urbanek 2011-07-13 21:23:49 UTC
It should be now completely fixed:

> x <- pattern <- paste(rep("a ", 600), collapse="")
> agrep(pattern, x)
[1] 1
> agrep(pattern, x, max.distance=0.5)
[1] 1
> 

tre_tnfa_run_approx() now uses a dynamically growing ring buffer (was fixed to 512 before).
Some of these fixes have speed implications because there may be additional buffer allocations, but they are probably negligible (the ring buffer switches from stack to heap when needed, but there is no equivalent switch from recursion to iteration).

Note, however, that TRE does impose additional limits that are arbitrary, for example:
> x <- pattern <- paste(rep("a ", 6000), collapse="")
> agrep(pattern, x)
Error in agrep(pattern, x) : regcomp error:  'Out of memory'