Bug 15884 - scan() failed to seperate Chinese Character string with spaces
Summary: scan() failed to seperate Chinese Character string with spaces
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: Windows GUI / Window specific (show other bugs)
Version: R 3.1.0
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 major
Assignee: R-core
Depends on:
Reported: 2014-07-16 04:31 UTC by y-li-12
Modified: 2014-07-16 04:31 UTC (History)
0 users

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description y-li-12 2014-07-16 04:31:13 UTC
I'm trying to read a string as a vector with scan(), and the string consists of Chinese Characters seperated by spaces, but the newest version of R seems to have a bug.

For instance, the input is:

> scan(text="R语言 是 一门 统计 专用 语言",what="character",encoding="UTF-8")

which should be seperated into 6 words by 5 spaces, but the output is:

  Read 4 items
  [1] "R语言"        "是 一门 统计" "专用"         "语言"    

I found this bug on R 3.1.1 and R 3.1.0 (both 32&64bit versions of R) on Windows 7 64bit. 

For R 3.0.3 (32&64 bit) on Windows 7, or R 3.1.0 (64bit) on Ubuntu 14.04, the function just works normally and returns 6 words.

For strings all in english, this function also works normally.

I finished my work with strsplit() function, but can anyone check this issue and fix it? Thanks!