Bug 16745 - strsplit(perl=TRUE, pattern="[[:<:]]", ...) gives wrong result
Summary: strsplit(perl=TRUE, pattern="[[:<:]]", ...) gives wrong result
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Misc (show other bugs)
Version: R 3.2.3
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-03-03 03:05 UTC by Bill Dunlap
Modified: 2016-03-08 16:17 UTC (History)
1 user (show)

See Also:


Attachments
Patch to change how strsplit(perl=TRUE) works with zero length matches (2.90 KB, patch)
2016-03-04 17:07 UTC, Mikko Korpela
Details | Diff
Updated patch (2.97 KB, patch)
2016-03-08 16:17 UTC, Mikko Korpela
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Bill Dunlap 2016-03-03 03:05:24 UTC
The perl regex "[[:<:]]" makes zero-length match at the beginning of a word ("[[:>]]" means end-of-word).  It acts properly in gregexpr but not in strsplit:

 gregexpr("[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
  #[1]  1  6 11
  #attr(,"match.length")
  #[1] 0 0 0
  #attr(,"useBytes")
  #[1] TRUE
 strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
  # [1] "O"  "n"  "e"  ", " "t"  "w"  "o"  "; " "t"  "h"  "r"  "e"  "e"  "!" 
  # Expect c("One, ", "two; ", "three!"), breaks before chars 1, 6, and 11

strsplit does act as expected for the zero-length look-ahead pattern "[[:>:]]":

 gregexpr("[[:>:]]", "One, two; three!", perl=TRUE)[[1]]
  #[1]  4  9 16
  #attr(,"match.length")
  #[1] 0 0 0
  #attr(,"useBytes")
  #[1] TRUE
 strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]
  #[1] "One"     ", two"   "; three" "!"

Not all zero-length look-behind patterns show this problem.  E.g.,

 strsplit(split="(?<=[[:punct:]])", "One, two; three!", perl=TRUE)[[1]]
  #[1] "One,"    " two;"   " three!"
      
It may be possible that strsplit is not using the startoffset argument
to pcre_exec

  pcre/pcre/doc/html/pcreapi.html 
    A non-zero starting offset is useful when searching for another match
    in the same subject by calling pcre_exec() again after a previous
    success. Setting startoffset differs from just passing over a
    shortened string and setting PCRE_NOTBOL in the case of a pattern that
    begins with any kind of lookbehind.

or it could be something else.
Comment 1 Bill Dunlap 2016-03-03 04:08:23 UTC
I noted that:
Not all zero-length look-behind patterns show this problem.  E.g.,

 strsplit(split="(?<=[[:punct:]])", "One, two; three!", perl=TRUE)[[1]]
  #[1] "One,"    " two;"   " three!"

However, if I expand that pattern to include the zero-length match at
the beginning of the string the problem appears again:

 strsplit(split="(?<=[[:punct:]])|^", "One, two; three!", perl=TRUE)[[1]]
  #[1] "O" "n" "e" "," " " "t" "w" "o" ";" " " "t" "h" "r" "e" "e" "!"
Comment 2 Mikko Korpela 2016-03-04 17:07:26 UTC
Created attachment 2036 [details]
Patch to change how strsplit(perl=TRUE) works with zero length matches

Indeed, strsplit(perl = TRUE) doesn't use the start offset. Dealing with zero length matches looks quite tricky, and it is not clear to me what the "proper" behavior is. Anyway, here is a quick, _poorly tested_ patch that appears to work almost as expected by the original poster.

I emphasize that the patch was quite a quick job. User beware.

> strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
[1] ""       "One, "  "two; "  "three!"
> strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]
[1] "One"     ", two"   "; three" "!"      

Tested on Linux, R-devel revision 70276 (PCRE 8.38).
Comment 3 Mikko Korpela 2016-03-08 16:17:39 UTC
Created attachment 2038 [details]
Updated patch

Here is another version of the patch with some problems fixed, maybe others introduced... Example output follows.

Original examples:
> strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
[1] ""       "One, "  "two; "  "three!"
> strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]
[1] "One"     ", two"   "; three" "!"      

New examples:
> strsplit(split="[[:<:]]|t", "One, two; three!", perl=TRUE)[[1]]
[1] ""      "One, " ""      "wo; "  ""      "hree!"
> strsplit(split="[[:>:]]|t", "One, two; three!", perl=TRUE)[[1]]
[1] "One"  ", "   "wo"   "; "   "hree" "!"   

Also, with split pattern "^", the output is quite different than without the patch.

Current implementation:
> strsplit("Foo", "^", perl=TRUE)[[1]]
[1] "F" "o" "o"

Patched version:
> strsplit("Foo", "^", perl=TRUE)[[1]]
[1] ""    "Foo"