Bug 17117 - %in% and match do not find two equal UTF-8 string, but identical() says they are the same
Summary: %in% and match do not find two equal UTF-8 string, but identical() says they ...
Status: CLOSED DUPLICATE of bug 16885
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R 3.3.*
Hardware: x86_64/x64/amd64 (64-bit) Linux
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-07-12 12:39 UTC by Emmanuel CURIS
Modified: 2016-07-12 14:15 UTC (History)
1 user (show)

See Also:


Attachments
The CSV file to reproduce the problem (36 bytes, text/csv)
2016-07-12 12:39 UTC, Emmanuel CURIS
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Emmanuel CURIS 2016-07-12 12:39:46 UTC
Created attachment 2125 [details]
The CSV file to reproduce the problem

Difficult to reproduce bug, probably caused by some tricky UTF-8 things, and also difficult to describe but see the example below with the attached CSV file....

## From UTF-8 csv file
d <- read.table( "essai_match.csv", sep = ";", header = TRUE )
( x <- names( d ) )

y.N <- grep( "\\.N$", x, value = TRUE )
y.D <- gsub( "\\.N$", "\\.D", y.N )

## All 3 are TRUE and match gives 1 as expected
y.N == x[ 1 ] ; identical( y.N, x[ 1 ] )
match( y.N, x[ 1 ] ) ; y.N %in% x[ 1 ]

## The two first are TRUE, but the two last are NA and FALSE !
y.D == x[ 2 ] ; identical( y.D, x[ 2 ] )
match( y.D, x[ 2 ] ) ; y.D %in% x[ 2 ]

## The same with hard-coded strings
x <- c( "Vitalité.....N", "Vitalité.....D" )

y.N <- grep( "\\.N$", x, value = TRUE )
y.D <- gsub( "\\.N$", "\\.D", y.N )

## All 3 are TRUE and match gives 1 as expected
y.N == x[ 1 ] ; identical( y.N, x[ 1 ] )
match( y.N, x[ 1 ] ) ; y.N %in% x[ 1 ]

## All 3 are TRUE and match gives 1 as expected
y.D == x[ 2 ] ; identical( y.D, x[ 2 ] )
match( y.D, x[ 2 ] ) ; y.D %in% x[ 2 ]

Basically: take a vector of two close strings, X.N and X.D. Try, with gsub, to change X.N into X.D, and then compare identical and %in% results. If X contain UTF-8 coded accents, identical correctly says that X.D and gsub( "\\.N$", "\\.D$", "X.N" ) are equal, but %in% or match fail if the strings are column names obtained from an UTF-8 CSV file. However, it works if the strings were hard-coded in the R-script.

It seems strange that identical says the two are equal but match does not find them!

version : R 3.3.0
OS : Linux Mageia 5, 64 bits
Comment 1 Emmanuel CURIS 2016-07-12 12:47:11 UTC
Additionnal tests: the problem seems to lie somewhere in make.names:
try the additional lines at the end of the previous script

x2 <- make.names( x )

y.N <- grep( "\\.N$", x2, value = TRUE )
y.D <- gsub( "\\.N$", "\\.D", y.N )

## All 3 are TRUE and match gives 1 as expected
y.N == x2[ 1 ] ; identical( y.N, x2[ 1 ] )
match( y.N, x2[ 1 ] ) ; y.N %in% x2[ 1 ]

## The two first are TRUE, but the two last are NA and FALSE !
y.D == x2[ 2 ] ; identical( y.D, x2[ 2 ] )
match( y.D, x2[ 2 ] ) ; y.D %in% x2[ 2 ]
Comment 2 Martin Maechler 2016-07-12 13:32:38 UTC
Can you please try with R 3.3.1 (or newer)?
I'm almost sure (but don't take the time to check using your *.csv etc) that this is really bug # 16885, fixed a while ago:


From  NEWS (of R 3.3.1), the  2nd entry in  BUG FIXES  is

    • match(x, t) (and hence x %in% t) failed when x was of length one,
      and either character and x and t only differed in their Encoding
      or when x and t where complex with NAs or NaNs.  (PR#16885.)
Comment 3 Emmanuel CURIS 2016-07-12 14:15:36 UTC
Indeed, it works with R 3.3.1. Sorry for the disturbance.

*** This bug has been marked as a duplicate of bug 16885 ***