Bug 15010 - Accented unicode (UTF-8) characters do not align correctly
Summary: Accented unicode (UTF-8) characters do not align correctly
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: I/O (show other bugs)
Version: R 2.15.1
Hardware: All Linux
: P5 trivial
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2012-08-04 11:43 UTC by Cagri Coltekin
Modified: 2012-08-05 01:05 UTC (History)
1 user (show)

See Also:


Attachments
A UTF-8 encoded file that contains problematic accents. (2.22 KB, application/octet-stream)
2012-08-04 11:43 UTC, Cagri Coltekin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Cagri Coltekin 2012-08-04 11:43:24 UTC
Created attachment 1356 [details]
A UTF-8 encoded file that contains problematic accents.

Some accented unicode characters (IPA symbols) align incorrectly while,for example, displaying the content of a data frame with values including these accents.

I use R 2.15.1 from Debian testing/unstable. I can produce the problem on both standard uxterm and roxterm, using a few different monospaced fonts. The same table displays fine outside R.

It looks like the length of the accented strings are miscalculated. The problem can be reproduced by reading in the attached file: read.table('ipa.vowels').

My locale setting is as follows,

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Please let me know if you'd like to have more information.
Best regards,

Cagri Coltekin
Comment 1 Simon Urbanek 2012-08-04 18:04:42 UTC
The issue seems to be that your table contains non-spacing, combining unicode characters, e.g., row 18 is "e\u031E" where U+031E is "combining down tack below' . As far as R is concerned those are two separate characters:

> "e\u031e"
[1] "e̞"
> nchar("e\u031e")
[1] 2

but they don't print as such (reproducible on Linux and OS X in UTF-8 locales).
Comment 2 Cagri Coltekin 2012-08-04 19:29:01 UTC
Thanks for the quick response.

I do not fully understand the explanation, but I understand that,

> substr("e̞",1,1)
[1] "e"
> substr("e̞",1,2)
[1] "e̞"

So, diacritic counts as a separate character in a character string. That's fair.

But the following (a smaller example this time), strikes me as wrong formatting.

> data.frame(sym=c("eee", "ee\u031ea", "e\u031ee\u031ea", "e\u031ee\u031ee\u031e", "eee"))
  sym
1 eee
2  ee̞a
3   e̞e̞a
4    e̞e̞e̞
5 eee

I think this is something everyone can live with, but the alignment is not the way anyone would appreciate. 

Best regards,
Cagri Coltekin
Comment 3 Brian Ripley 2012-08-05 01:05:54 UTC
On 04/08/2012 18:04, r-bugs@r-project.org wrote:
> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=15010
>
> Simon Urbanek <simon.urbanek@r-project.org> changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |simon.urbanek@r-project.org
>             Platform|x86_64/x64/amd64 (64-bit)   |All
>
> --- Comment #1 from Simon Urbanek <simon.urbanek@r-project.org> 2012-08-04 13:04:42 EDT ---
> The issue seems to be that your table contains non-spacing, combining unicode
> characters, e.g., row 18 is "e\u031E" where U+031E is "combining down tack
> below' . As far as R is concerned those are two separate characters:
>
>> "e\u031e"
> [1] "eÌž"
>> nchar("e\u031e")
> [1] 2
>
> but they don't print as such (reproducible on Linux and OS X in UTF-8 locales).


But I think the issue is more likely to be the code underlying

 > nchar("e\u031e", type="w")
[1] 0

Widths of character strings is an OS service, one that is notoriously 
unreliable.   My memory is that Ei-ji Nakama replaced this on some OSes 
(certainly on OS X).  We may be able to do better now that e.g. ICU is 
more widely available (but not on Windows).

But this really is an esoteric subject (except to people writing CJK 
languages), and maybe the problem is expecting R to handle such things. 
  It needs someone interested and knowledgeable (such as Mr Nakama) to 
contribute a well-tested patch.


-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595