Bug 16377 - Strange behaviour for polish special character "ł"
Summary: Strange behaviour for polish special character "ł"
Alias: None
Product: R
Classification: Unclassified
Component: Windows GUI / Window specific (show other bugs)
Version: R 3.1.3
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 minor
Assignee: R-core
Depends on:
Reported: 2015-05-11 10:20 UTC by Peter Meissner
Modified: 2015-05-11 12:38 UTC (History)
1 user (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Peter Meissner 2015-05-11 10:20:39 UTC
On a Windows machine with R 3.1.3 as well as 3.2.0 the following behavior is quite surprising and most likely a bug:

## character(0)

## [1] "ł"

## Error: object 'l' not found

ł <- 357576
## [1] "l"

## [1] 1

## [1] 1

## [1] "l"

It seems like the "ł"-character gets internally translated / downgraded to a simple "l" (small letter L) which should not happen - right?

The real problem is (despite that things get assigned to the wrong object, "l" instead of "ł") that the issue does interfere with text processing in general. See e.g. https://github.com/petermeissner/wikipediatrend/issues/12

# ------------------------------------------------------------------------


## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
## [4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## loaded via a namespace (and not attached):
## [1] tools_3.2.0

# ------------------------------------------------------------------------

It was tested on Linux Ubuntu 14.04 LTS as well and everything works just fine.
Comment 1 Duncan Murdoch 2015-05-11 10:49:57 UTC
I haven't traced through the details, but the likely cause of this is that you are working in a code page that does not contain that character.  R can represent it in a UTF-8 string, but many things are converted to the local character set for processing.  

It's a long-standing problem that Windows doesn't support UTF-8 natively, though all but very old versions can work with it to some extent. However, R was written back when it couldn't, and so R does a lot of conversions to the native encoding and back.

I think it would make more sense if we were starting from scratch to use UTF-8 internally for all strings, regardless of the native encoding.  However, changing to that is a really huge amount of work and would be quite disruptive.
So all I can advise is that you should use a different code page (maybe 1250?) if you're working with both German and Polish characters.  Or use Linux.
Comment 2 Peter Meissner 2015-05-11 12:38:06 UTC
I suspect that all R-Windows users working with text would appreciate the disruption - though I understand that this is no easy step. 

Thanks for the reply, clarification and suggestion.