Bug 14958 - Package using UTF-8 is broken when built in ISO-8859-1 locale
Package using UTF-8 is broken when built in ISO-8859-1 locale
Status: CLOSED FIXED
Product: R
Classification: Unclassified
Component: Misc
R 2.15.0 patched
x86_64/x64/amd64 (64-bit) Linux
: P5 normal
Assigned To: R-core
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-06-25 13:09 UTC by Mikko Korpela
Modified: 2012-06-28 15:45 UTC (History)
0 users

See Also:


Attachments
Sources of package that triggers the bug (743 bytes, application/x-gzip)
2012-06-25 13:09 UTC, Mikko Korpela
Details
Tarball built in ISO-8859-1 locale (811 bytes, application/x-gzip)
2012-06-25 13:09 UTC, Mikko Korpela
Details
Tarball built in UTF-8 locale (801 bytes, application/x-gzip)
2012-06-25 13:10 UTC, Mikko Korpela
Details
Diff of extracted ISO and UTF tarballs (265 bytes, text/plain)
2012-06-25 13:11 UTC, Mikko Korpela
Details
Hex dump of diff.txt (1.31 KB, text/plain)
2012-06-25 13:11 UTC, Mikko Korpela
Details
Check log, tarball built in ISO-8859-1 locale (2.78 KB, text/x-log)
2012-06-25 13:11 UTC, Mikko Korpela
Details
Check log, tarball built in UTF-8 locale (2.26 KB, text/x-log)
2012-06-25 13:12 UTC, Mikko Korpela
Details
DESCRIPTION file from ISO-8859-1 package (462 bytes, text/plain)
2012-06-25 13:13 UTC, Mikko Korpela
Details
DESCRIPTION file from UTF-8 package (465 bytes, text/plain)
2012-06-25 13:13 UTC, Mikko Korpela
Details
Console log of reproducing the bug (6.68 KB, text/x-log)
2012-06-25 13:14 UTC, Mikko Korpela
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mikko Korpela 2012-06-25 13:09:10 UTC
Created attachment 1320 [details]
Sources of package that triggers the bug

Overview

When a package using the UTF-8 encoding and non-ASCII characters in
the Authors@R field of its DESCRIPTION file is built in an ISO-8859-1
locale, the resulting tarball (.tar.gz) fails R CMD check --as-cran.

Note: This is about R 2.15.1 patched. It was not available as an option in the Version selector.

List of Attached Files

1. encTest.tar.gz           Sources of package that triggers the bug
2. encTest_1.0_ISO.tar.gz   Tarball built in ISO-8859-1 locale
3. encTest_1.0_UTF.tar.gz   Tarball built in UTF-8 locale
4. diff.txt                 Diff of extracted ISO and UTF tarballs
5. diff_hexdump.txt         Hex dump of diff.txt
6. 00check_ISO.log          Check log, tarball built in ISO-8859-1 locale
7. 00check_UTF.log          Check log, tarball built in UTF-8 locale
8. DESCRIPTION_ISO          DESCRIPTION file from ISO-8859-1 package
9. DESCRIPTION_UTF          DESCRIPTION file from UTF-8 package
A. encTest-console.log      Console log of reproducing the bug

Files DESCRIPTION_ISO and diff.txt can be opened in ISO-8859-1 mode
(they are not valid UTF-8).  Other text files can be opened in UTF-8
mode, but only ASCII characters are used in diff_hexdump.txt.

If you want to "R CMD check" or "R CMD INSTALL" one of the attached tarballs (files 2 and 3 in the list), the "_ISO" or "_UTF" substring must be removed from the file name.

Steps to Reproduce

Notes:
* Steps 2-4 are optional if you only want to see the actual results
that don't match the expected results. For a comparison where
everything works as expected, also perform steps 2-4.
* See attached log file encTest-console.log (uses UTF-8 encoding)

1. Create a package that declares "Encoding: UTF-8" in
DESCRIPTION. Use some non-ASCII, but valid UTF-8 characters in the
Authors@R field.
1.b You can find such a package in the attached file
encTest.tar.gz. In this example, I use my complete name, non-ASCII
characters included, in Authors@R. Extract the contents of the
archive. On Unix-like systems with GNU tar, you can use the command
  tar xzf encTest.tar.gz

2. Build the package in a UTF-8 locale. Use "R CMD build encTest",
where encTest is the package directory created by extracting
encTest.tar.gz in step 1.

For information on setting the locale, see section "7.1 Locales" of
the "R Installation and Administration manual". On the Linux
installation I am using, the default locale uses UTF-8. The output of
command "locale" is included in the log file encTest-console.log.

3. Check the source package built in step 2 using R CMD check
--as-cran.

4. Make copies of the source package built in step 2 and the check log
produced by step 3 for later inspection.

5. Build the package in an ISO-8859-1 locale. See step 2. On the Linux
installation I am using (bash shell), LC_ALL=en_US.iso88591 in front
of "R CMD build" works for setting the locale temporarily, just for
that one command.

6. Check as in step 3.

7. Optionally make copies of the source package built in step 5 and
the check log produced by step 6. This is not necessary if the files
will not be overwritten by further builds / checks.

Actual Results

Tarball built using a UTF-8 locale passes R CMD check --as-cran (step
3 above). The same check (step 6) finds errors when the package is
built using an ISO-8859-1 locale. It does not matter whether the
checks are performed in a UTF-8 or an ISO-8859-1 locale (the attached
log files only include output of checks done in a UTF-8 locale).

Looking at the tarball built in step 5, we find that the non-ASCII
characters originally present in the "Authors@R" field of the
DESCRIPTION file still have UTF-8 encoding, but the "Author" field
automatically generated in the build step incorrectly uses ISO-8859-1
encoding. The "Maintainer" field, also automatically generated, is
UTF-8 as it should be.

Expected Results

The "Author" field of the tarball built using an ISO-8859-1 locale
should have UTF-8 encoding instead of ISO-8859-1 encoding, thus
respecting "Encoding: UTF-8".  Both tarballs should pass R CMD check
without errors.

Build Date & Platform

The relevant lines from R --version:

R version 2.15.1 Patched (2012-06-24 r59622) -- "Roasted Marshmallows"
Platform: x86_64-unknown-linux-gnu (64-bit)
Comment 1 Mikko Korpela 2012-06-25 13:09:56 UTC
Created attachment 1321 [details]
Tarball built in ISO-8859-1 locale
Comment 2 Mikko Korpela 2012-06-25 13:10:28 UTC
Created attachment 1322 [details]
Tarball built in UTF-8 locale
Comment 3 Mikko Korpela 2012-06-25 13:11:03 UTC
Created attachment 1323 [details]
Diff of extracted ISO and UTF tarballs
Comment 4 Mikko Korpela 2012-06-25 13:11:28 UTC
Created attachment 1324 [details]
Hex dump of diff.txt
Comment 5 Mikko Korpela 2012-06-25 13:11:53 UTC
Created attachment 1325 [details]
Check log, tarball built in ISO-8859-1 locale
Comment 6 Mikko Korpela 2012-06-25 13:12:54 UTC
Created attachment 1326 [details]
Check log, tarball built in UTF-8 locale
Comment 7 Mikko Korpela 2012-06-25 13:13:25 UTC
Created attachment 1327 [details]
DESCRIPTION file from ISO-8859-1 package
Comment 8 Mikko Korpela 2012-06-25 13:13:44 UTC
Created attachment 1328 [details]
DESCRIPTION file from UTF-8 package
Comment 9 Mikko Korpela 2012-06-25 13:14:31 UTC
Created attachment 1329 [details]
Console log of reproducing the bug
Comment 10 Brian Ripley 2012-06-28 10:10:23 UTC
The real solution is 'do not do that'.

No one quarantees that R CMD build is able to build a package in other than a locale using the specified encoding (and in many cases it cannot).

A workaround for the specific case (UTF-8 with characters in the current locale) has been added to R-devel.
Comment 11 Mikko Korpela 2012-06-28 15:45:26 UTC
Thanks for working on this issue. I tested the workaround and found it effective at least for this particular case where the non-ASCII characters can be represented in both UTF-8 and ISO-8859-1 character sets. Encoding issues sure can be tricky.

I saw the edits made to "Writing R Extensions" where it is now recommended, among other things, that non-ASCII Author and Maintainer fields be included in DESCRIPTION (not auto-generated). Good to know.