At present the regexp used to match white space in trimws() is "[ \t\r\n]+". I have just spent a confused 30 minutes debugging a situation where there was trailing non-breaking white space (Unicode charachter 00a0). I therefore suggest amending the whitespace regexp in trimws() to: "[ \t\r\n\u00a0]+".
As `trimws()` uses `sub(..., perl=TRUE)`, we could consider much more, e.g., using `"\s"` , see
However, the help page ?trimws contains
For portability, ‘whitespace’ is taken as the character class
‘[ \t\r\n]’ (space, horizontal tab, line feed, carriage return).
so there was the idea that "white space" should not depend on the locale etc.
I propose we add an argument whitespace = "[ \t\r\n]"
to the function, so we'd remain back compatible
and allow more extended definitions of whitespace.
Yes, that sounds like a good idea. It would have saved me having to redefine the function. Perhaps the docs could remind users that there are characters that might appear as white space, e.g. \u00a0.
Kurt Hornik mentioned how to specify a all horizontal + vertical spaces;
also added an example.
Now committed to R-devel, svn c 74838