Bug 16337 - Add data.frame support to sample()
Summary: Add data.frame support to sample()
Status: NEW
Alias: None
Product: R
Classification: Unclassified
Component: Misc (show other bugs)
Version: R-devel (trunk)
Hardware: Other Linux
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-04-26 10:14 UTC by Oliver Keyes
Modified: 2015-05-01 12:30 UTC (History)
1 user (show)

See Also:


Attachments
Suggested patch (2.31 KB, patch)
2015-04-26 10:14 UTC, Oliver Keyes
Details | Diff
Amended patch (2.50 KB, patch)
2015-05-01 12:30 UTC, Oliver Keyes
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Keyes 2015-04-26 10:14:56 UTC
Created attachment 1804 [details]
Suggested patch

One of the most commonly used R datatypes is the (non-primitive, but powerful) data.frame. They're probably the most widely used datatype, in practice, outside of atomic vectors.

Despite this, sample() does not have support for row-based sampling of data.frames; if you submit a data.frame, it treats it as it would a list (because a data.frame /is/ a list) and returns a random selection from within one column. The result is that every R programmer I know has at some point written something like:

     df[sample(1:nrow(df),size),]

Usually multiple times.

This .diff adds support for row-based sampling of data.frames, hopefully without breaking anything, while maintaining the code conventions used in the function. It also updates the documentation to reflect this, and (as miscellaneous tweaks) ups the copyright year and notes that sample() accepts lists - while lists are recursive vectors, many people use 'vector' as a generic term for atomic vectors.
Comment 1 Duncan Murdoch 2015-05-01 11:53:48 UTC
This seems like it might be a useful addition, but your patch doesn't handle the case of missing "size" properly.  And it doesn't sample rows from arrays, and since sample() isn't a generic, we don't have methods for other complex types. 
It also makes it harder to sample columns of a dataframe (though I don't know if anyone actually does that).

I think the "right" way to do this would be to make sample() into a generic, with a sample.data.frame method that allowed sampling by row or column, but that would impose costs on all current uses, so it's not perfect either.

Overall, I'm not so sure that this is worth doing.
Comment 2 Oliver Keyes 2015-05-01 12:25:28 UTC
Hey Duncan!

Thanks for the review :)

(In reply to Duncan Murdoch from comment #1)
> This seems like it might be a useful addition, but your patch doesn't handle
> the case of missing "size" properly.

Good spot with the size() issue - I'll patch that now and throw in an updated patch off trunk. 

> And it doesn't sample rows from
> arrays, and since sample() isn't a generic, we don't have methods for other
> complex types.

It doesn't, but the same is true of the existing implementation (try sample(array(1:3, c(2,4)), 1) for example); this is a patch for a specific use case rather than a total rewrite. 

> It also makes it harder to sample columns of a dataframe (though I don't
> know if anyone actually does that).

I've not ever seen that done, either. What I do see is people needing to sample rows, though, fairly regularly. The use case tends to be: data stored in data.frames trends towards rows for entries and columns for values, which means that the common operation of "randomly sampling from your dataset" would be sampling by row. Sampling by column instead causes confusion at the user end and makes it harder for base R functions to be used to achieve the desired goal.

> 
> I think the "right" way to do this would be to make sample() into a generic,
> with a sample.data.frame method that allowed sampling by row or column, but
> that would impose costs on all current uses, so it's not perfect either.
> 

Agreed; I considered doing that but discounted it for precisely that reason - that S3 lookups tend to be very slow. That's why I chose to patch for this specific (common) use case instead.

> Overall, I'm not so sure that this is worth doing.

There we'll disagree, but I'll amend the patch so that it doesn't actively break things and hopefully the use case will have convinced you by the time I come back ;).
Comment 3 Oliver Keyes 2015-05-01 12:30:34 UTC
Created attachment 1816 [details]
Amended patch

Amended patch to handle missing size values correctly in the case of data.frames (thanks again, Duncan)