Bug 16009 - gsub() does not handle NA in character vectors
Summary: gsub() does not handle NA in character vectors
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R 3.1.1
Hardware: All All
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2014-10-04 02:38 UTC by Roland Seubert
Modified: 2014-10-08 17:41 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Roland Seubert 2014-10-04 02:38:18 UTC
Collapsing multiple (2 or more) whitespace characters within elements of a vector into a single one, using gsub() fails when the vector contains NA values:

> x <- c(NA, "  abc", "a b c    ", "a  b c")
> gsub("\\s{2,}", " ", x)
[1] NA  " " " " " "

However, it works fine if one uses Perl-like regular expressions:

> gsub("\\s{2,}", " ", x, perl = TRUE)
[1] NA       " abc"   "a b c " "a b c"

or if matching is done by byte:

> gsub("\\s{2,}", " ", x, useBytes = TRUE)
[1] NA       " abc"   "a b c " "a b c"

This was initially posted on Stack Overflow, where two users reported in the comments of my question that the behaviour does also occur under Mac OS X (R 3.1.1) but not under Windows, I encountered this on Linux x86-64.

http://stackoverflow.com/questions/26174360/r-regex-issues-with-character-vectors-containing-nas
Comment 1 Duncan Murdoch 2014-10-05 11:31:49 UTC
I see the error; will fix in R-devel and R-patched after some testing.
Comment 2 Duncan Murdoch 2014-10-05 11:58:17 UTC
Oops, there appear to be two errors here, and I only know how to fix one of them.

The first error is that the NA in x caused the code to think there were non-ASCII characters, and gsub switched to UTF-8 handling.

The second error is that UTF-8 handling doesn't work.  You'll get a similar error in a UTF-8 locale if you put some non-ASCII character in place of the NA, e.g.

> x <- c("ä", "  abc", "a b c    ", "a  b c")
> gsub("\\s{2,}", " ", x)
[1] "ä" " " " " " "

This looks like a bug in the TRE library that we use for regular expression matching.  In the wide-char variant used for UTF-8 strings (and current also for strings containing NA, but I can fix that), that regexpr always matches the whole string if it matches anything.
Comment 3 Duncan Murdoch 2014-10-05 15:14:24 UTC
A few more examples where things go wrong (at least in a UTF-8 locale):

x <- "abcxyz123ä"
sub("[[:blank:]]{2,}", "", x)  # bad
sub("[ \t]{2,}", "", x)        # okay
sub("[[:alpha:]]{2,}", "", x)  # bad

In fact, every character class given as [:<name>:] gives the bad behaviour, whether it matches or not.
Comment 4 Duncan Murdoch 2014-10-07 18:23:13 UTC
Ville Laurikari (author of the TRE library) confirmed the character class problem is a bug in TRE 0.8.0, but said he won't be able to get to it soon.  If anyone is familiar enough with it to track this down, I'm sure we'd all appreciate it.
Comment 5 Duncan Murdoch 2014-10-08 17:41:37 UTC
I've got it -- TRE was forgetting the character class sometimes.  I've got a fix in place, and have emailed it to Ville Laurikari.  I'll commit to R-devel and R-patched after some testing.