Bug 14518 - Wishlist: named capture in regular expressions
Wishlist: named capture in regular expressions
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Low-level
R-devel (trunk)
All All
: P5 enhancement
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-03 17:20 UTC by Toby Dylan Hocking
Modified: 2011-03-24 10:36 UTC (History)
0 users

See Also:


Attachments
patch for src/main/grep.c that implements named capture in g?regexpr using PCRE (8.77 KB, patch)
2011-03-03 17:20 UTC, Toby Dylan Hocking
Details | Diff
R code to test the new features (1.91 KB, text/plain)
2011-03-03 17:21 UTC, Toby Dylan Hocking
Details
patch for named capture with unicode support (11.09 KB, patch)
2011-03-10 23:36 UTC, Toby Dylan Hocking
Details | Diff
R code to test the new features (including unicode support) (2.84 KB, application/octet-stream)
2011-03-10 23:37 UTC, Toby Dylan Hocking
Details
patch for named capture with unicode support (no errors on make check) (11.41 KB, patch)
2011-03-16 13:30 UTC, Toby Dylan Hocking
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Toby Dylan Hocking 2011-03-03 17:20:10 UTC
Created attachment 1179 [details]
patch for src/main/grep.c that implements named capture in g?regexpr using PCRE

I wrote the following to r-devel but so far have had no response, so I thought I would try this feature request here.

One feature from Python that I have been wanting in R is the ability
to capture groups in regular expressions using names. Consider the
following example in R.

> notables <- c("  Ben Franklin and Jefferson Davis","\tMillard Fillmore")
> name.rex <- "(?<first>[A-Z][a-z]+) (?<last>[A-Z][a-z]+)"
> (parsed <- regexpr(name.rex,notables,perl=TRUE))
[1] 3 2
attr(,"match.length")
[1] 12 16
attr(,"capture.start")
     [,1] [,2]
[1,]    3    7
[2,]    2   10
attr(,"capture.length")
     [,1] [,2]
[1,]    3    8
[2,]    7    8
attr(,"capture.names")
[1] "first" "last" 
> parse.one(notables,parsed)
     first     last      
[1,] "Ben"     "Franklin"
[2,] "Millard" "Fillmore"
> parse.one(notables,parsed)[,"last"]
[1] "Franklin" "Fillmore"

The advantage to this approach is that you can tag groups by name, and
then use the names later in the code to extract the matched substrings.

I realized this is possible by using the PCRE library which ships with
R, so in the last couple days I hacked a bit in src/main/grep.c in the
R source code. I managed to get named capture to work with the
standard gregexpr and regexpr functions. For backwards-compatibility,
my strategy was to just add more attributes to the results of these
functions, as shown above.

Attached is the patch and some R code for testing the new features. It
works fine for me with no memory problems. However, I noticed that
there is some UTF8 handling code, which I did not touch (use_UTF8 is
false on my machine). I presume we will need to make some small
modifications to get it to work with unicode, but I'm not sure how to
do them.

Would you consider integrating this patch into the R source code for
future releases, so the larger R community can take advantage of this
feature? If there's anything else I can do to help please let me know.
Comment 1 Toby Dylan Hocking 2011-03-03 17:21:41 UTC
Created attachment 1180 [details]
R code to test the new features
Comment 2 Brian Ripley 2011-03-03 22:14:48 UTC
Please do note the FAQ on how to mark wishes.

We won't consider this until it has been tested in a UTF-8 locale.
Comment 3 Toby Dylan Hocking 2011-03-10 23:36:22 UTC
Created attachment 1183 [details]
patch for named capture with unicode support
Comment 4 Toby Dylan Hocking 2011-03-10 23:37:04 UTC
Created attachment 1184 [details]
R code to test the new features (including unicode support)
Comment 5 Toby Dylan Hocking 2011-03-10 23:41:12 UTC
I have made additional changes to grep.c and tested it on Japanese characters to make sure unicode processing works as expected. Would you please consider this new patch for inclusion in the R source?

> jpstr <- c("ユーティーエフはち、ユーティーエフエイト","ユーティーエフはち","english words")
> l <- gregexpr("エ(?<two>..)(?<rest>[^エ]*)",jpstr,perl=TRUE)
> result2list(jpstr,l)
[[1]]
     two    rest            
[1,] "フは" "ち、ユーティー"
[2,] "フエ" "イト"          

[[2]]
     two    rest
[1,] "フは" "ち"

[[3]]
NULL

> (parsed <- regexpr("エ(?<two>..)(?<rest>[^エ]*)",jpstr,perl=TRUE))
[1]  6  6 -1
attr(,"match.length")
[1] 10  4 -1
attr(,"capture.start")
     [,1] [,2]
[1,]    7    9
[2,]    7    9
[3,]   -1   -1
attr(,"capture.length")
     [,1] [,2]
[1,]    2    7
[2,]    2    1
[3,]   -1   -1
attr(,"capture.names")
[1] "two"  "rest"
> parse.one(jpstr,parsed)
     two    rest            
[1,] "フは" "ち、ユーティー"
[2,] "フは" "ち"            
[3,] ""     ""
Comment 6 Brian Ripley 2011-03-15 12:18:02 UTC
This is some of the least legible C code I have ever seen!  Please use
spaces and follow indentation standards.

'make check' fails when the patch is applied: that really is a minimal
test before submission.

I at least simply don't have the resources to spend debugging patches
which fail basic testing.
Comment 7 Toby Dylan Hocking 2011-03-16 13:30:14 UTC
Created attachment 1186 [details]
patch for named capture with unicode support (no errors on make check)

I am very sorry about the legibility problems of my previous patch, and the fact that it caused an error in 'make check'. This is the first time I have submitted a patch and I was not aware of the standard protocol for writing and checking the R source code. However, for this new patch, I have tried to remedy these problems:

- I fixed the error that was coming up in 'make check'.

- The differences reported by 'make check' in base-Ex.Rout and internet.Rout are there before and after applying my patch, which clearly implies that my patch is not causing them.

- I read the "R Coding Standards" part of the "R Internals" manual, and I used the indentation that it recommends.

- I read the "Testing R code" part of the "R Internals" manual. I did 'cd tests;make no-segfault.Rout' and observed no segfaults.

Please consider this updated patch for inclusion in the next release of R.
Comment 8 Brian Ripley 2011-03-21 13:22:25 UTC
Not for 2.13.x, at least.
Comment 9 Brian Ripley 2011-03-24 10:36:56 UTC
To follow  the style of the submitted code:
'Acorrectedversionhasbeenputinthetrunk(2.14.0tobe).'