Bug 15952 - Problem with sort? inconsistent results.
Summary: Problem with sort? inconsistent results.
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: Misc (show other bugs)
Version: R 3.1.0
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 major
Assignee: R-core
Depends on:
Reported: 2014-08-27 20:05 UTC by felasa
Modified: 2014-08-28 15:15 UTC (History)
1 user (show)

See Also:

unsortable character vectors (2.38 KB, application/x-gzip)
2014-08-27 20:05 UTC, felasa

Note You need to log in before you can comment on or make changes to this bug.
Description felasa 2014-08-27 20:05:32 UTC
Created attachment 1652 [details]
unsortable character vectors

Getting unexpected behavior with sort and order.  

I have 2 character vectors: char_1 and char_2. Both length 329.

I get this if i try to order them equally, output is given as comment:

length(unique(char_1)) #  329
length(unique(char_2))  # 329

identical(char_1, char_2)  #  FALSE
sum(char_1 %in% char_2)  #  329
sum(char_2 %in% char_1)  #  329

s1 <- sort(char_1)
### These vectors come from a (supposed) previous sorting but:
identical(char_1, s1) #  FALSE

s2 <- sort(char_2)
identical(char_2, s2)  #  FALSE

identical(s1, s2)  #  FALSE (!)

# Moreover ... 
identical(sort(s1), s1) ## FALSE (!!)
identical(sort(sort(s1)), sort(s1)) ## FALSE
identical( sort(sort(sort(s1))), sort(s1))## TRUE

So it seems sort() can't unequivocally sort them. Is this in anyway expected? using order() gives the similar results.

attached is an .RData file with the character vectors in question.

Output of SessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[1] LC_COLLATE=Spanish_Mexico.1252  LC_CTYPE=Spanish_Mexico.1252    LC_MONETARY=Spanish_Mexico.1252
[4] LC_NUMERIC=C                    LC_TIME=Spanish_Mexico.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.0
Comment 1 Simon Urbanek 2014-08-27 21:32:41 UTC
The collation order and uniqueness are not necessarily identical and will depend on your locale. Since you're not using a stable sort, the order of elements that are identical in the collation order of your locale, but not identical values will be (semi)random.
Comment 2 Peter Dalgaard 2014-08-27 21:50:12 UTC
It is expected if collation is ambiguous, as it is in some locales. However, I do not have Spanish_Mexico.1252 to hand and I can't reproduce it with es_ES.UTF-8.

you need to dig deeper and find which strings are changing place, then see how they compare. Something along the lines of

s1 <- sort(char_1)
all(s1[-1] > s1[-329])
which(!(s1[-1] > s1[-329]))

ss1 <- sort(s1)
identical(s1, ss1)
which(s1 != sort(s1))
Comment 3 felasa 2014-08-28 15:15:47 UTC
String that change place:


which(!(s1[-1] > s1[-329])) -> ind


c("adelante”", "paliativos", "subsecuentedoñasdoñ")

#And after:

which(s1 != sort(s1)) -> ind2

c("caas", "caastañedacrismatt", "cçdej", "ccde", "çodul", "codul", 
"conmanej", "conmapñai", "conm", "conmal", "estre", "estrech", 
"estrechabirads", "estrechamnet", "estrechap", "estrector", "estregen", 
"estreimient", "estreiñ", "näuse", "nause", "paliativosactual", 
"paliativosconsult", "paliativosh", "paliativoshay", "paliativosp", 
"paliativossdolor", "paliativosseñal", "paliativos", "paliativosasintomatica", 
"paliativoscuers", "paliativosna", "pequ", "pequen", "pequeñ", 
"pequeñacon", "pequeñans", "pequeñit", "pequeñom", "pequeñomedian", 
"pequeñosmedian", "pequieñ", "tanmañ", "tanm", "ulcer", "ülcer"