Bug 16885 - bug in match (I think it's related to PR#16491)
Summary: bug in match (I think it's related to PR#16491)
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R 3.3.0
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 major
Assignee: R-core
URL:
: 16909 17117 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-05-06 02:57 UTC by Xianying Tan
Modified: 2016-07-12 14:15 UTC (History)
7 users (show)

See Also:


Attachments
patch and regression tests (1.08 KB, patch)
2016-05-06 23:27 UTC, Peter Haverty
Details | Diff
HTML showing speed with patch in place (15.43 KB, text/html)
2016-05-06 23:27 UTC, Peter Haverty
Details
Rmd showing speed of patch in place (1.35 KB, text/plain)
2016-05-06 23:28 UTC, Peter Haverty
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Xianying Tan 2016-05-06 02:57:34 UTC
### Description

For  non ASCII  character, the `match(x, table)` function behaves differently, for `x`'s length is 1 or longer than 1, when `table` contains escaped unicode. This phenomenon occurs only in R 3.3.0. Note, it's tested only in a Windows 7 x64 machine. Not sure if it will occur in other platform.

### Reproducible Example

#### In R 3.3.0
```r
> tmp <- "年付"
> tmp2 <- stringi::stri_escape_unicode(tmp)
> cat(tmp2)
\u5e74\u4ed8
> tmp2 <- "\u5e74\u4ed8"
> tmp %in% tmp2
[1] FALSE
> c(tmp, "a") %in% tmp2
[1]  TRUE FALSE
```
#### In R 3.2.5
```r
> tmp <- "年付"
> tmp2 <- stringi::stri_escape_unicode(tmp)
> cat(tmp2)
\u5e74\u4ed8
> tmp2 <- "\u5e74\u4ed8"
> tmp %in% tmp2
[1] TRUE
> c(tmp, "a") %in% tmp2
[1]  TRUE FALSE
```
### Conclusion

As you can see, under R 3.3.0, the code `tmp %in% tmp2` returns `FALSE` while `c(tmp, "a") %in% tmp2` returns `TRUE`. I believe is a bug in PR#16491 according the the news of R 3.3.0.

Thanks.
Comment 1 Martin Maechler 2016-05-06 16:37:45 UTC
Thank you for reproducible example code.

Indeed I cannot reproduce your problem on Fedora 22 Linux;
- in one case, in "Latin1" (I think) setup I saw different unicode characters and got consistenly FALSE,
- and in the regular (unicode 8) setup, I get the same 2 unicode sequences as you, and then consistently TRUE  (i.e., "TRUE" and  "TRUE FALSE") the same you say you get in R 3.2.5.

We'd happy to hear of other platform results.

One easy  -but ugly- fix would be to *not* use the speedup code if running in Windows...  and I doubt somehow that the problem should only show there. Maybe for different "wide characters" the problem could also show on other platforms ?
Comment 2 Peter Haverty 2016-05-06 23:27:07 UTC
Created attachment 2079 [details]
patch and regression tests

Using "Seql" fixes matches that differ only in character encoding.
Comment 3 Peter Haverty 2016-05-06 23:27:42 UTC
Created attachment 2080 [details]
HTML showing speed with patch in place
Comment 4 Peter Haverty 2016-05-06 23:28:14 UTC
Created attachment 2081 [details]
Rmd showing speed of patch in place
Comment 5 Suharto Anggono 2016-05-06 23:39:09 UTC
Example like in ?Encoding :
x <- "fa\xEile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8")
match(x, xx)

In usual case of 'match', string equality testing uses function 'sequal' in unique.c.
Comment 6 Suharto Anggono 2016-05-07 00:04:51 UTC
(In reply to Suharto Anggono from comment #5)
> Example like in ?Encoding :
> x <- "fa\xEile"

Sorry, it should be the following:
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8")
match(x, xx)
Comment 7 Suharto Anggono 2016-05-07 01:03:06 UTC
Difference in equality testing between the special case and the usual case also occurs for complex vector.
x <- as.complex(0/0)
y <- as.complex(NaN)
match(x, y)
match(rep(x, 2), y)
Comment 8 Suharto Anggono 2016-05-07 01:12:30 UTC
(In reply to Suharto Anggono from comment #7)
> Difference in equality testing between the special case and the usual case
> also occurs for complex vector.

A more usual example:
x <- complex(real = NaN)
y <- complex(imaginary = NaN)
match(x, y)
match(rep(x, 2), y)
Comment 9 Martin Maechler 2016-05-07 06:59:58 UTC
(In reply to Suharto Anggono from comment #6)
> (In reply to Suharto Anggono from comment #5)
> > Example like in ?Encoding :
> > x <- "fa\xEile"
> 
> Sorry, it should be the following:
> x <- "fa\xE7ile"
> Encoding(x) <- "latin1"
> xx <- iconv(x, "latin1", "UTF-8")
> match(x, xx)

Thanks a lot for simple example... 
in the mean time I also had found the  Encoding(.) <- ..  
to produce platform independent examples.
Comment 10 Martin Maechler 2016-05-07 07:45:54 UTC
(In reply to Peter Haverty from comment #2)
> Created attachment 2079 [details]
> patch and regression tests
> 
> Using "Seql" fixes matches that differ only in character encoding.

Thank you, Pete.  The patch solves the problem.
The regression test does not trigger in R 3.3.0 in Linux.
I will use a different one.
Comment 11 Martin Maechler 2016-05-07 20:28:55 UTC
(In reply to Suharto Anggono from comment #8)
> (In reply to Suharto Anggono from comment #7)
> > Difference in equality testing between the special case and the usual case
> > also occurs for complex vector.
> 
> A more usual example:
> x <- complex(real = NaN)
> y <- complex(imaginary = NaN)
> match(x, y)
> match(rep(x, 2), y)

Indeed...  this is a bit less grave than the string case, notably because this 
raises quite some issues  match / unique / ...   working with complex NAs.

The documentation has always said that all complex NA's should be considered the same .. (but different from the non-NA  NaN's).... and that's also how such complex NA's are print()ed / format()ed.


 I had fix committed which exactly provided the documented behavior for match() with NA's ... but was *not* back compatible.

I think the fix is "the right thing"  and plan to commit to R-devel...

where I'll port a silly back-compatibility hack (disabling length-1 match for complex) to R "3.3.0 patched"
Comment 12 Suharto Anggono 2016-05-09 08:38:47 UTC
I believe that the example provided by the reporter works in a Chinese locale that doesn't use UTF-8.

This is my experiment in http://www.tutorialspoint.com/r_terminal_online.php.

> Sys.setlocale("LC_CTYPE", "zh_CN.gbk")                            
[1] "zh_CN.gbk"                                                     
> tmp2 <- "\u5e74\u4ed8"                                            
> Encoding(tmp2)                                                    
[1] "UTF-8"                                                         
> tmp <- iconv(tmp2, "UTF-8")                                       
> Encoding(tmp)                                                     
[1] "unknown"                                                       
> tmp == tmp2                                                       
[1] TRUE                                                            
> sessionInfo()                                                     
R version 3.2.3 (2015-12-10)                                        
Platform: x86_64-redhat-linux-gnu (64-bit)                          
Running under: Fedora 23 (Twenty Three)                             
                                                                    
locale:                                                             
 [1] LC_CTYPE=zh_CN.gbk  LC_NUMERIC=C        LC_TIME=C              
 [4] LC_COLLATE=C        LC_MONETARY=C       LC_MESSAGES=C          
 [7] LC_PAPER=C          LC_NAME=C           LC_ADDRESS=C           
[10] LC_TELEPHONE=C      LC_MEASUREMENT=C    LC_IDENTIFICATION=C    
                                                                    
attached base packages:                                             
[1] stats     graphics  grDevices utils     datasets  methods   base
                                                                    

On Windows, one can precede by Sys.setlocale("LC_CTYPE", "Chinese"), if it succeeds.
Comment 13 Xianying Tan 2016-05-10 05:18:36 UTC
(In reply to Suharto Anggono from comment #12)
> I believe that the example provided by the reporter works in a Chinese
> locale that doesn't use UTF-8.
> 
> This is my experiment in http://www.tutorialspoint.com/r_terminal_online.php.
> 
> > Sys.setlocale("LC_CTYPE", "zh_CN.gbk")                            
> [1] "zh_CN.gbk"                                                     
> > tmp2 <- "\u5e74\u4ed8"                                            
> > Encoding(tmp2)                                                    
> [1] "UTF-8"                                                         
> > tmp <- iconv(tmp2, "UTF-8")                                       
> > Encoding(tmp)                                                     
> [1] "unknown"                                                       
> > tmp == tmp2                                                       
> [1] TRUE                                                            
> > sessionInfo()                                                     
> R version 3.2.3 (2015-12-10)                                        
> Platform: x86_64-redhat-linux-gnu (64-bit)                          
> Running under: Fedora 23 (Twenty Three)                             
>                                                                     
> locale:                                                             
>  [1] LC_CTYPE=zh_CN.gbk  LC_NUMERIC=C        LC_TIME=C              
>  [4] LC_COLLATE=C        LC_MONETARY=C       LC_MESSAGES=C          
>  [7] LC_PAPER=C          LC_NAME=C           LC_ADDRESS=C           
> [10] LC_TELEPHONE=C      LC_MEASUREMENT=C    LC_IDENTIFICATION=C    
>                                                                     
> attached base packages:                                             
> [1] stats     graphics  grDevices utils     datasets  methods   base
>                                                                     
> 
> On Windows, one can precede by Sys.setlocale("LC_CTYPE", "Chinese"), if it
> succeeds.

Yes, my locale is not UTF-8 under windows, because changing the locale setting to UTF-8 in windows will lead to more headaches...

Below is my sessionInfo for now, if it helps:

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.2.5

Let me know if there's anything that I can help. Thanks!
Comment 14 Martin Maechler 2016-05-11 15:36:25 UTC
....
> Let me know if there's anything that I can help. Thanks!

There's no further need for help, as I have committed fixes to both problems mentioned in this report (string w encodings // complex with NA / NaN).

Note that the latter lead to my re-discvoering of a long-standing inconsistency, or rather bug.  and R-devel (to be 3.4.0) and may even earlier future versions of R will behave different in this complex-NA-NaN cases.
Comment 15 Martin Maechler 2016-05-17 08:35:10 UTC
*** Bug 16909 has been marked as a duplicate of this bug. ***
Comment 16 krzysztofpankow1 2016-06-09 12:39:56 UTC
(In reply to Martin Maechler from comment #14)
> ....
> > Let me know if there's anything that I can help. Thanks!
> 
> There's no further need for help, as I have committed fixes to both problems
> mentioned in this report (string w encodings // complex with NA / NaN).
> 
> Note that the latter lead to my re-discvoering of a long-standing
> inconsistency, or rather bug.  and R-devel (to be 3.4.0) and may even
> earlier future versions of R will behave different in this complex-NA-NaN
> cases.

Hello,
I don't know if my bug belongs to the two groups mentioned above; here it is:

match(x="é",table = c("é")) gives 0. 

The problem occurs for all french signs.
Best regards,
Krzysztof
Comment 17 Michael Lawrence 2016-06-09 12:47:19 UTC
Yes, that is the same issue and it has been fixed in R patched.
Comment 18 Emmanuel CURIS 2016-07-12 14:15:36 UTC
*** Bug 17117 has been marked as a duplicate of this bug. ***