Bug 16728 - gsub does not handle properly new lines in regular expressions
Summary: gsub does not handle properly new lines in regular expressions
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R 3.2.3
Hardware: x86_64/x64/amd64 (64-bit) Linux
: P5 major
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-02-26 07:58 UTC by Luca Cerone
Modified: 2016-02-26 09:58 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Luca Cerone 2016-02-26 07:58:27 UTC
Hi, I think there is a bug in the way gsub() finds the end of line.

For example:

x <- "   this is \n    some test\n text    "
#I want to remove all leading spaces at the beginning of lines
gsub("^\\s+","",x) 
[1] "this is \n    some test\n text    "
I get the same result with gsub("^\\s+", "",x, perl=TRUE)

gsub correctly replaced the first occurrence, but it seems it has not correctly recognized the newline character so that the next occurrences are not substituded properly.

The expected output is:
"this is \nsome test\ntext    "

Similarly if I try to remove all the white spaces till the end of the line:
gsub("\\s+$","",x)
[1] "   this is \n    some test\n text"

whereas I would have expected:
"   this is\n    some test\n text"

Note I have also tried using stringr::str_replace_all() but the results are the same.
Comment 1 Mikko Korpela 2016-02-26 09:58:11 UTC
This is because the circumflex "^" is an anchor to the beginning of the whole string, not to any position after a newline.

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_09

I think the mixed use of the words "line" and "string" in the standard reflects the fact that pattern matching is often performed one line at a time, i.e. after multiline input has been split into separate strings at line breaks.

The fine Regex Tutorial at http://www.regular-expressions.info/anchors.html states that some tools do treat the "^" as an anchor to the start of each line in the input.

You could try the following substitution (lookbehind requires perl=TRUE):

gsub("^\\s+|(?<=\n)\\s+", "", x, perl=TRUE)