Created attachment 1179 [details]
patch for src/main/grep.c that implements named capture in g?regexpr using PCRE
I wrote the following to r-devel but so far have had no response, so I thought I would try this feature request here.
One feature from Python that I have been wanting in R is the ability
to capture groups in regular expressions using names. Consider the
following example in R.
> notables <- c(" Ben Franklin and Jefferson Davis","\tMillard Fillmore")
> name.rex <- "(?<first>[A-Z][a-z]+) (?<last>[A-Z][a-z]+)"
> (parsed <- regexpr(name.rex,notables,perl=TRUE))
 3 2
 12 16
[1,] 3 7
[2,] 2 10
[1,] 3 8
[2,] 7 8
 "first" "last"
[1,] "Ben" "Franklin"
[2,] "Millard" "Fillmore"
 "Franklin" "Fillmore"
The advantage to this approach is that you can tag groups by name, and
then use the names later in the code to extract the matched substrings.
I realized this is possible by using the PCRE library which ships with
R, so in the last couple days I hacked a bit in src/main/grep.c in the
R source code. I managed to get named capture to work with the
standard gregexpr and regexpr functions. For backwards-compatibility,
my strategy was to just add more attributes to the results of these
functions, as shown above.
Attached is the patch and some R code for testing the new features. It
works fine for me with no memory problems. However, I noticed that
there is some UTF8 handling code, which I did not touch (use_UTF8 is
false on my machine). I presume we will need to make some small
modifications to get it to work with unicode, but I'm not sure how to
Would you consider integrating this patch into the R source code for
future releases, so the larger R community can take advantage of this
feature? If there's anything else I can do to help please let me know.
Created attachment 1180 [details]
R code to test the new features
Please do note the FAQ on how to mark wishes.
We won't consider this until it has been tested in a UTF-8 locale.
Created attachment 1183 [details]
patch for named capture with unicode support
Created attachment 1184 [details]
R code to test the new features (including unicode support)
I have made additional changes to grep.c and tested it on Japanese characters to make sure unicode processing works as expected. Would you please consider this new patch for inclusion in the R source?
> jpstr <- c("ユーティーエフはち、ユーティーエフエイト","ユーティーエフはち","english words")
> l <- gregexpr("エ(?<two>..)(?<rest>[^エ]*)",jpstr,perl=TRUE)
[1,] "フは" "ち、ユーティー"
[2,] "フエ" "イト"
[1,] "フは" "ち"
> (parsed <- regexpr("エ(?<two>..)(?<rest>[^エ]*)",jpstr,perl=TRUE))
 6 6 -1
 10 4 -1
[1,] 7 9
[2,] 7 9
[3,] -1 -1
[1,] 2 7
[2,] 2 1
[3,] -1 -1
 "two" "rest"
[1,] "フは" "ち、ユーティー"
[2,] "フは" "ち"
[3,] "" ""
This is some of the least legible C code I have ever seen! Please use
spaces and follow indentation standards.
'make check' fails when the patch is applied: that really is a minimal
test before submission.
I at least simply don't have the resources to spend debugging patches
which fail basic testing.
Created attachment 1186 [details]
patch for named capture with unicode support (no errors on make check)
I am very sorry about the legibility problems of my previous patch, and the fact that it caused an error in 'make check'. This is the first time I have submitted a patch and I was not aware of the standard protocol for writing and checking the R source code. However, for this new patch, I have tried to remedy these problems:
- I fixed the error that was coming up in 'make check'.
- The differences reported by 'make check' in base-Ex.Rout and internet.Rout are there before and after applying my patch, which clearly implies that my patch is not causing them.
- I read the "R Coding Standards" part of the "R Internals" manual, and I used the indentation that it recommends.
- I read the "Testing R code" part of the "R Internals" manual. I did 'cd tests;make no-segfault.Rout' and observed no segfaults.
Please consider this updated patch for inclusion in the next release of R.
Not for 2.13.x, at least.
To follow the style of the submitted code: