Bugzilla – Bug 14958
Package using UTF-8 is broken when built in ISO-8859-1 locale
Last modified: 2012-06-28 15:45:26 UTC
Created attachment 1320 [details]
Sources of package that triggers the bug
When a package using the UTF-8 encoding and non-ASCII characters in
the Authors@R field of its DESCRIPTION file is built in an ISO-8859-1
locale, the resulting tarball (.tar.gz) fails R CMD check --as-cran.
Note: This is about R 2.15.1 patched. It was not available as an option in the Version selector.
List of Attached Files
1. encTest.tar.gz Sources of package that triggers the bug
2. encTest_1.0_ISO.tar.gz Tarball built in ISO-8859-1 locale
3. encTest_1.0_UTF.tar.gz Tarball built in UTF-8 locale
4. diff.txt Diff of extracted ISO and UTF tarballs
5. diff_hexdump.txt Hex dump of diff.txt
6. 00check_ISO.log Check log, tarball built in ISO-8859-1 locale
7. 00check_UTF.log Check log, tarball built in UTF-8 locale
8. DESCRIPTION_ISO DESCRIPTION file from ISO-8859-1 package
9. DESCRIPTION_UTF DESCRIPTION file from UTF-8 package
A. encTest-console.log Console log of reproducing the bug
Files DESCRIPTION_ISO and diff.txt can be opened in ISO-8859-1 mode
(they are not valid UTF-8). Other text files can be opened in UTF-8
mode, but only ASCII characters are used in diff_hexdump.txt.
If you want to "R CMD check" or "R CMD INSTALL" one of the attached tarballs (files 2 and 3 in the list), the "_ISO" or "_UTF" substring must be removed from the file name.
Steps to Reproduce
* Steps 2-4 are optional if you only want to see the actual results
that don't match the expected results. For a comparison where
everything works as expected, also perform steps 2-4.
* See attached log file encTest-console.log (uses UTF-8 encoding)
1. Create a package that declares "Encoding: UTF-8" in
DESCRIPTION. Use some non-ASCII, but valid UTF-8 characters in the
1.b You can find such a package in the attached file
encTest.tar.gz. In this example, I use my complete name, non-ASCII
characters included, in Authors@R. Extract the contents of the
archive. On Unix-like systems with GNU tar, you can use the command
tar xzf encTest.tar.gz
2. Build the package in a UTF-8 locale. Use "R CMD build encTest",
where encTest is the package directory created by extracting
encTest.tar.gz in step 1.
For information on setting the locale, see section "7.1 Locales" of
the "R Installation and Administration manual". On the Linux
installation I am using, the default locale uses UTF-8. The output of
command "locale" is included in the log file encTest-console.log.
3. Check the source package built in step 2 using R CMD check
4. Make copies of the source package built in step 2 and the check log
produced by step 3 for later inspection.
5. Build the package in an ISO-8859-1 locale. See step 2. On the Linux
installation I am using (bash shell), LC_ALL=en_US.iso88591 in front
of "R CMD build" works for setting the locale temporarily, just for
that one command.
6. Check as in step 3.
7. Optionally make copies of the source package built in step 5 and
the check log produced by step 6. This is not necessary if the files
will not be overwritten by further builds / checks.
Tarball built using a UTF-8 locale passes R CMD check --as-cran (step
3 above). The same check (step 6) finds errors when the package is
built using an ISO-8859-1 locale. It does not matter whether the
checks are performed in a UTF-8 or an ISO-8859-1 locale (the attached
log files only include output of checks done in a UTF-8 locale).
Looking at the tarball built in step 5, we find that the non-ASCII
characters originally present in the "Authors@R" field of the
DESCRIPTION file still have UTF-8 encoding, but the "Author" field
automatically generated in the build step incorrectly uses ISO-8859-1
encoding. The "Maintainer" field, also automatically generated, is
UTF-8 as it should be.
The "Author" field of the tarball built using an ISO-8859-1 locale
should have UTF-8 encoding instead of ISO-8859-1 encoding, thus
respecting "Encoding: UTF-8". Both tarballs should pass R CMD check
Build Date & Platform
The relevant lines from R --version:
R version 2.15.1 Patched (2012-06-24 r59622) -- "Roasted Marshmallows"
Platform: x86_64-unknown-linux-gnu (64-bit)
Created attachment 1321 [details]
Tarball built in ISO-8859-1 locale
Created attachment 1322 [details]
Tarball built in UTF-8 locale
Created attachment 1323 [details]
Diff of extracted ISO and UTF tarballs
Created attachment 1324 [details]
Hex dump of diff.txt
Created attachment 1325 [details]
Check log, tarball built in ISO-8859-1 locale
Created attachment 1326 [details]
Check log, tarball built in UTF-8 locale
Created attachment 1327 [details]
DESCRIPTION file from ISO-8859-1 package
Created attachment 1328 [details]
DESCRIPTION file from UTF-8 package
Created attachment 1329 [details]
Console log of reproducing the bug
The real solution is 'do not do that'.
No one quarantees that R CMD build is able to build a package in other than a locale using the specified encoding (and in many cases it cannot).
A workaround for the specific case (UTF-8 with characters in the current locale) has been added to R-devel.
Thanks for working on this issue. I tested the workaround and found it effective at least for this particular case where the non-ASCII characters can be represented in both UTF-8 and ISO-8859-1 character sets. Encoding issues sure can be tricky.
I saw the edits made to "Writing R Extensions" where it is now recommended, among other things, that non-ASCII Author and Maintainer fields be included in DESCRIPTION (not auto-generated). Good to know.