Bug 16264 - Errors on Windows with grep(fixed=TRUE) on UTF-8 strings
Summary: Errors on Windows with grep(fixed=TRUE) on UTF-8 strings
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R 3.1.2
Hardware: Other Linux
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-03-16 15:42 UTC by Winston Chang
Modified: 2015-09-22 09:10 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Winston Chang 2015-03-16 15:42:08 UTC
This is a followup to:
https://stat.ethz.ch/pipermail/r-devel/2015-March/070735.html
https://stat.ethz.ch/pipermail/r-devel/2015-March/070759.html


This bug happens with grep(), when:
* Running in a multibyte locale. This is common in Windows, though it can also happen on Mac or Linux.
* The search uses fixed=TRUE.
* The search pattern is a single byte.
* The current locale has a multibyte encoding.

=======================

Here's an example that demonstrates the bug in Windows:

# First, create a 3-byte UTF-8 character
x <- rawToChar(as.raw(c(0xe5, 0x8d, 0x88)))
Encoding(x) <- "UTF-8"
x
# [1] "午"

# In my default locale, grep with a single-char pattern and fixed=TRUE
# returns integer(0), as expected.
Sys.getlocale("LC_CTYPE")
# [1] "English_United States.1252"
grep("a", x, fixed = TRUE)
# integer(0)

# When the using a multibyte locale, grep with a single-char
# pattern and fixed=TRUE results in an error.
Sys.setlocale("LC_CTYPE", "chinese")
grep("a", x, fixed = TRUE)
# Error in grep("a", x, fixed = TRUE) : invalid multibyte string at '<97>'

=======================

This code can replicate the bug on a Mac:

x <- rawToChar(as.raw(c(0xe5, 0x8d, 0x88)))
Encoding(x) <- "UTF-8"
x
# [1] "午"

# With a UTF-8 locale, no error
Sys.setlocale("LC_ALL", "en_US.UTF-8")
gsub("x", "", x, fixed = TRUE)
# [1] "午"

# With a multibyte, non-UTF-8 locale, error
Sys.setlocale("LC_ALL", "ja_JP.SJIS")
gsub("x", "", x, fixed = TRUE)
# Error in gsub("#", "", x, fixed = TRUE) : 
#   invalid multibyte string at '<88>'

=======================

It's also possible to replicate the bug on Linux, although it is more complicated, because it involves installing new locales. If desired, I can post those instructions as well.

=======================

I believe the problem is in the main/grep.c file, in the fgrep_one
function. It tests for a multi-byte character string locale
`mbcslocale`, and then for the `use_UTF8`, like so:

    if (!useBytes && mbcslocale) {
        ...
    } else if (!useBytes && use_UTF8) {
        ...
    } else ...

This can be seen at
https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L668-L692

A similar pattern occurs in the fgrep_one_bytes function, at
https://github.com/wch/r-source/blob/e92b4c1cba05762480cd3898335144e5dd111cb7/src/main/grep.c#L718-L736


I believe that the test order should be reversed; it should test first
for `use_UTF8`, and then for `mbcslocale`. This pattern occurs in a few
places in grep.c. It looks like this:

    if (!useBytes && use_UTF8) {
        ...
    } else if (!useBytes && mbcslocale) {
        ...
    } else ...


=======================

This patch does what I described; it simply tests for `use_UTF8` first,
and then `mbcslocale`, in both fgrep_one and fgrep_one_bytes. I made
this patch against the 3.1.2 sources, and tested the example code above.
In both cases, grep() returned integer(0), as expected.

(The reason I made this change against 3.1.2 is because I had problems
getting the current trunk to compile on both Linux or Windows.)


diff --git src/main/grep.c src/main/grep.c
index 6e6ec3e..348c63d 100644
--- src/main/grep.c
+++ src/main/grep.c
@@ -664,27 +664,27 @@ static int fgrep_one(const char *pat, const char *target,
 	    }
 	return -1;
     }
-    if (!useBytes && mbcslocale) { /* skip along by chars */
-	mbstate_t mb_st;
+    if (!useBytes && use_UTF8) {
 	int ib, used;
-	mbs_init(&mb_st);
 	for (ib = 0, i = 0; ib <= len-plen; i++) {
 	    if (strncmp(pat, target+ib, plen) == 0) {
 		if (next != NULL) *next = ib + plen;
 		return i;
 	    }
-	    used = (int) Mbrtowc(NULL,  target+ib, MB_CUR_MAX, &mb_st);
+	    used = utf8clen(target[ib]);
 	    if (used <= 0) break;
 	    ib += used;
 	}
-    } else if (!useBytes && use_UTF8) {
+    } else if (!useBytes && mbcslocale) { /* skip along by chars */
+	mbstate_t mb_st;
 	int ib, used;
+	mbs_init(&mb_st);
 	for (ib = 0, i = 0; ib <= len-plen; i++) {
 	    if (strncmp(pat, target+ib, plen) == 0) {
 		if (next != NULL) *next = ib + plen;
 		return i;
 	    }
-	    used = utf8clen(target[ib]);
+	    used = (int) Mbrtowc(NULL,  target+ib, MB_CUR_MAX, &mb_st);
 	    if (used <= 0) break;
 	    ib += used;
 	}
@@ -714,21 +714,21 @@ static int fgrep_one_bytes(const char *pat, const char *target, int len,
 	    if (*p == pat[0]) return i;
 	return -1;
     }
-    if (!useBytes && mbcslocale) { /* skip along by chars */
-	mbstate_t mb_st;
+    if (!useBytes && use_UTF8) { /* not really needed */
 	int ib, used;
-	mbs_init(&mb_st);
 	for (ib = 0, i = 0; ib <= len-plen; i++) {
 	    if (strncmp(pat, target+ib, plen) == 0) return ib;
-	    used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
+	    used = utf8clen(target[ib]);
 	    if (used <= 0) break;
 	    ib += used;
 	}
-    } else if (!useBytes && use_UTF8) { /* not really needed */
+    } else if (!useBytes && mbcslocale) { /* skip along by chars */
+	mbstate_t mb_st;
 	int ib, used;
+	mbs_init(&mb_st);
 	for (ib = 0, i = 0; ib <= len-plen; i++) {
 	    if (strncmp(pat, target+ib, plen) == 0) return ib;
-	    used = utf8clen(target[ib]);
+	    used = (int) Mbrtowc(NULL, target+ib, MB_CUR_MAX, &mb_st);
 	    if (used <= 0) break;
 	    ib += used;
 	}
Comment 1 Duncan Murdoch 2015-09-19 16:43:01 UTC
I've now committed your patch to R-devel.  If it tests out okay, I'll backport to R-patched.

BTW, it is a lot easier for me to handle a patch if it is produced by svn diff, rather than by git.  This is one reason for the delay in handling this bug report.
Comment 2 Duncan Murdoch 2015-09-22 09:10:37 UTC
Now fixed in R-patched as well as R-devel.