Bug 17431 - Suggestion that trimws() should remove non-breaking white space
Summary: Suggestion that trimws() should remove non-breaking white space
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R 3.5.0
Hardware: Other Linux
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2018-06-01 09:01 UTC by David Sterratt
Modified: 2018-06-02 20:59 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Sterratt 2018-06-01 09:01:51 UTC
At present the regexp used to match white space in trimws() is "[ \t\r\n]+". I have just spent a confused 30 minutes debugging a situation where there was trailing non-breaking white space (Unicode charachter 00a0). I therefore suggest amending the whitespace regexp in trimws() to: "[ \t\r\n\u00a0]+".
Comment 1 Martin Maechler 2018-06-01 09:59:07 UTC
As  `trimws()`  uses  `sub(..., perl=TRUE)`, we could consider much more, e.g., using  `"\s"`  , see

https://perldoc.perl.org/perlrecharclass.html#Whitespace

However,  the help page  ?trimws   contains


 Details:

     For portability, ‘whitespace’ is taken as the character class
     ‘[ \t\r\n]’ (space, horizontal tab, line feed, carriage return).

so there was the idea that "white space" should not depend on the locale etc.

I propose we add an argument   whitespace = "[ \t\r\n]"
to the function, so we'd remain back compatible
and allow more extended definitions of whitespace.
Comment 2 David Sterratt 2018-06-01 12:31:54 UTC
Yes, that sounds like a good idea. It would have saved me having to redefine the function. Perhaps the docs could remind users that there are characters that might appear as white space, e.g. \u00a0.
Comment 3 Martin Maechler 2018-06-02 20:59:03 UTC
Kurt Hornik mentioned how to specify a all horizontal + vertical spaces;
also added an example.

Now committed to R-devel, svn c 74838