Bug 16598 - inconsistency in sample() with particular probability vector and 0 probability weights
Summary: inconsistency in sample() with particular probability vector and 0 probabilit...
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Accuracy (show other bugs)
Version: R 3.2.2
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-11-11 08:39 UTC by Andreas Kersting
Modified: 2015-11-12 10:30 UTC (History)
1 user (show)

See Also:


Attachments
probability vector p (103 bytes, application/x-gzip)
2015-11-11 08:39 UTC, Andreas Kersting
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andreas Kersting 2015-11-11 08:39:59 UTC
Created attachment 1933 [details]
probability vector p

Although it does not say explicitly in the documentation, the following code will (generally) return true:

a <- b <- rep(NA_integer_, 1e4)
set.seed(123)
for (i in 1:1e4) {
  this.p <- runif(6)
  this.seed <- .Random.seed
  a[i] <- sample(1:6, 1, p=this.p)
  .Random.seed <<- this.seed
  b[i] <- sample(c(rep(0, i), 1:6), 1, p=c(rep(0, i), this.p))
}
all(a==b)

i.e. the result of sample does not change if we add elements to x which have a probability weight of 0. However, I came across one particular probability vector p = (0.17942180, 0.12002511, 0.45462425, 0.07368815, 0.05221558, 0.12002511) [see attached probs.Rdata] for which this is not (always) true:

load("probs.Rdata")
a <- b <- rep(NA_integer_, 1e4)
set.seed(123)
for (i in 1:1e4) {
  this.seed <- .Random.seed
  a[i] <- sample(1:6, 1, p=p)
  .Random.seed <<- this.seed
  b[i] <- sample(c(rep(0, i), 1:6), 1, p=c(rep(0, i), p))
}
all(a==b)

which returns false. If we inspect the logical vector a==b more closely, we find huge blocks of TRUE:

all((a==b)[8192:1e4])
all((a==b)[2048:4078])

but we also find blocks with a lot of FALSE.
Comment 1 Mikko Korpela 2015-11-12 10:30:58 UTC
It appears that this happens because a sort operation performed by sample() is not stable, i.e. does not preserve the order of equal values, in this case equal probabilities.

If we modify the R (development version) source code as follows,

<diff>

--- src/main/random.c	(revision 69623)
+++ src/main/random.c	(working copy)
@@ -409,6 +409,11 @@
     /* Sort probabilities into descending order */
     /* Order element identities in parallel */
     revsort(p, perm, n);
+    for (i = 0; i < n; i++) {
+        if (p[i] > 0) {
+            Rprintf("%d: %f\n", perm[i], p[i]);
+        }
+    }
 
     /* Compute the sample */
     totalmass = 1;

</diff>

, we can examine how the probabilities are ordered. Then, by manually executing the first two rounds of the example loop, we see that on the second round, item 6 comes before item 2 when there are no zero probabilities, but item 8 (previously 6) comes _after_ item 4 (previously 2) when the two zero probabilities are included.

This is just an observation. I'm not saying that the sort should be stable.

> set.seed(123)
> this.seed <- .Random.seed
> sample(1:6, 1, p=p)
3: 0.454624
1: 0.179422
6: 0.120025
2: 0.120025
4: 0.073688
5: 0.052216
[1] 3
> .Random.seed <<- this.seed
> i <- 1
> sample(c(rep(0, i), 1:6), 1, p=c(rep(0, i), p))
4: 0.454624
2: 0.179422
7: 0.120025
3: 0.120025
5: 0.073688
6: 0.052216
[1] 3
> i <- 2
> this.seed <- .Random.seed
> sample(1:6, 1, p=p)
3: 0.454624
1: 0.179422
6: 0.120025
2: 0.120025
4: 0.073688
5: 0.052216
[1] 2
> .Random.seed <<- this.seed
> sample(c(rep(0, i), 1:6), 1, p=c(rep(0, i), p))
5: 0.454624
3: 0.179422
4: 0.120025
8: 0.120025
6: 0.073688
7: 0.052216
[1] 6