Bug 16855 - Ignored after a first single-byte character on system that sizeof(whcar_t) return 2 with grep, [g]sub, [g]regexpr
Summary: Ignored after a first single-byte character on system that sizeof(whcar_t) re...
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R 3.2.4 revised
Hardware: All Other
: P5 minor
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-04-26 08:06 UTC by erw7.github
Modified: 2016-04-26 08:06 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description erw7.github 2016-04-26 08:06:22 UTC
This bug happens with grep(), [g]sub(), [g]regexpr when:
* R compiled on system that sizeof(wchar_t) return 2(e.g. Cygwin).
* Running in a multibyte local.
* The search pattern or text contain a single byte character.
* The current locale has a multibyte encoding.

Here's an example
# First, create three 3-byte UTF-8 character
a <- rawToChar(as.raw(c(0xe3, 0x81, 0x82)))
Encoding(a) <- "UTF-8"
a
# [1] "あ"
i <- rawToChar(as.raw(c(0xe3, 0x81, 0x84)))
Encoding(i) <- "UTF-8"
i
# [1] "い"
u <- rawToChar(as.raw(c(0xe3, 0x81, 0x86)))
Encoding(u) <- "UTF-8"
u
# [1] "う"
grep(paste(a, "a", i, sep = ""), paste(a, "a", u sep = ""))
# [1] 1
# correct behavior is return intger (0)
grep(i, paste(a, "a", i))
# intger (0)
#  correct behavior is return [1] 1
=============
I believe the problem is in the main/sysutils.c file, static const char TO_WCHAR[] definition. Despite sizeof(wchar_t) return 2, it are defined in the UCS-LE4.
=============
This patch will be defined in the case of sizeof(wchar_t) return 2 to UCS-2LE or UCS-2BE.

--- m4/R.m4.orig        2016-04-24 16:49:34.626084100 +0900
+++ m4/R.m4     2016-04-26 14:49:53.374930500 +0900
@@ -3506,6 +3506,9 @@
     want_mbcs_support=no
   fi
 fi
+if test "$want_mbcs_support" = yes; then
+  AC_CHECK_SIZEOF(wchar_t, [], [#include <wchar.h>])
+fi
 if test "x${want_mbcs_support}" != xyes; then
 AC_MSG_ERROR([Support for MBCS locales is required.])
 fi

--- src/main/sysutils.c.orig    2016-04-24 17:56:48.062015600 +0900
+++ src/main/sysutils.c 2016-04-26 15:07:37.859267000 +0900
@@ -1009,13 +1009,17 @@
 }


-#ifdef (Win32)
+#if defined(Win32) || (SIZEOF_WCHAR_T == 2 && !defined(WORDS_BIGENDIAN))
 static const char TO_WCHAR[] = "UCS-2LE";
 #else
-# ifdef WORDS_BIGENDIAN
-static const char TO_WCHAR[] = "UCS-4BE";
+# if SIZEOF_WCHAR_T == 2
+static const char TO_WCHAR[] = "UCS-2BE";
 # else
+#  ifdef WORDS_BIGENDIAN
+static const char TO_WCHAR[] = "UCS-4BE";
+#  else
 static const char TO_WCHAR[] = "UCS-4LE";
+#  endif
 # endif
 #endif