Bug 17128 - as.name() produces unexpected results for strings in non-native encoding
Summary: as.name() produces unexpected results for strings in non-native encoding
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Low-level (show other bugs)
Version: R-devel (trunk)
Hardware: Other All
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-08-09 22:48 UTC by Kirill Müller
Modified: 2017-03-06 10:46 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kirill Müller 2016-08-09 22:48:21 UTC
as.name() doesn't seem to respect the declared encoding of a character argument. This is with R-devel r71070 on Ubuntu:

> a <- "ä"
> Encoding(a)
[1] "UTF-8"
> as.name(a)
`ä`
> Encoding(as.character(as.name(a)))
[1] "UTF-8"
> ai <- iconv(a, to = "latin1")
> Encoding(ai)
[1] "latin1"
> as.name(ai)
ä
> Encoding(as.character(as.name(ai)))
[1] "unknown"

A similar result with R 3.3.0-patched on Windows:

> a <- "ä"
> Encoding(a)
[1] "latin1"
> as.name(a)
ä
> Encoding(as.character(as.name(a)))
[1] "latin1"
> ai <- enc2utf8(a)
> Encoding(ai)
[1] "UTF-8"
> as.name(ai)
`ä`
> Encoding(as.character(as.name(ai)))
[1] "unknown"

I think as.name() should call enc2native() on the argument first. The call() function seems to do this already (results consistent on Linux and Windows):

> identical(call(a), call(ai))
[1] TRUE
Comment 1 Martin Maechler 2016-08-11 07:24:17 UTC
This is not in my main level of expertise; but I think your proposed change would invoke some performance penalty for all uses of  as.name() / as.symbol() and -- for internal consistency -- probably even the underlying C code which may be called even more often than the R level as.name|as.symbol.

Of course it can be that the performance penalty would almost always be so small as to be irrelevant.

OTOH, can you give examples where the current behavior is a true problem?
Comment 2 Kirill Müller 2016-09-07 18:24:33 UTC
It's not a true problem, just inconsistent. By the same token, as.character.call() and as.character.name() probably should return encoded strings (currently the returned strings are marked as "unknown").

A comment in the R sources hints that the current behavior of as.name() is intended:

https://github.com/wch/r-source/blob/cb97095ca0b4760440b684f804e6ae0460bdbd66/src/main/names.c#L1248-L1250

Please feel free to close.
Comment 3 Tomas Kalibera 2017-03-06 10:46:40 UTC
I've fixed this bug based on the reports that it causes real problems to Windows/CJK users. The fix is inside C code called by as.name - strings are converted to native encoding before they are installed into the symbol table.