Bug 17280 - Wishlist: aggregate.data.frame(drop=FALSE) to include value for unused factor level
Summary: Wishlist: aggregate.data.frame(drop=FALSE) to include value for unused factor...
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: 3.4.0
Hardware: All All
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2017-05-27 14:59 UTC by Suharto Anggono
Modified: 2017-06-04 13:01 UTC (History)
0 users

See Also:


Attachments
Against R devel r72750 (927 bytes, patch)
2017-06-01 08:16 UTC, Suharto Anggono
Details | Diff
Against R devel r72750, not call function on empty subsets (1.30 KB, patch)
2017-06-04 02:08 UTC, Suharto Anggono
Details | Diff
Against R devel r72750, not call function on empty subsets (1.26 KB, patch)
2017-06-04 13:01 UTC, Suharto Anggono
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Suharto Anggono 2017-05-27 14:59:53 UTC
Example 3 in https://stat.ethz.ch/pipermail/r-help/2016-May/438631.html , modified from http://stackoverflow.com/questions/22523131/dplyr-summarise-equivalent-of-drop-false-to-keep-groups-with-zero-length-in :
> DF <- data.frame(a=rep(1:3,4), b=factor(rep(1:2,6), levels=1:3))
> aggregate(DF["a"], DF["b"], length, drop=FALSE)
  b a
1 1 6
2 2 6

The result doesn't have a row for "3" in 'b', that never appears in the data.

It is in R 3.3.0, and still in R 3.4.0.

I expect that drop=FALSE means retaining everything, as in 'interaction' with drop=FALSE. I expect that result of 'aggregate.data.frame' with drop=FALSE has a row for every level of factor grouping variable. For the example, I expect that the result has a row for "3" in 'b'.

Example 2 in https://stat.ethz.ch/pipermail/r-help/2016-May/438631.html , modified from "Compute the averages according to region and the occurrence of more than 130 days of frost" in "Examples" in R help on 'aggregate':
> aggregate(state.x77,
+           list(Region = state.region,
+                Cold = state.x77[,"Frost"] > 130),
+           mean, drop = FALSE)
         Region  Cold Population   Income Illiteracy Life Exp    Murder
1     Northeast FALSE  8802.8000 4780.400  1.1800000 71.12800  5.580000
2         South FALSE  4208.1250 4011.938  1.7375000 69.70625 10.581250
3 North Central FALSE  7233.8333 4633.333  0.7833333 70.95667  8.283333
4          West FALSE  4582.5714 4550.143  1.2571429 71.70000  6.828571
5     Northeast  TRUE  1360.5000 4307.500  0.7750000 71.43500  3.650000
6         South  TRUE        NaN      NaN        NaN      NaN       NaN
7 North Central  TRUE  2372.1667 4588.833  0.6166667 72.57667  2.266667
8          West  TRUE   970.1667 4880.500  0.7500000 70.69167  7.666667
   HS Grad    Frost      Area
1 52.06000 110.6000  21838.60
2 44.34375  64.6250  54605.12
3 53.36667 120.0000  56736.50
4 60.11429  51.0000  91863.71
5 56.35000 160.5000  13519.00
6      NaN      NaN       NaN
7 55.66667 157.6667  68567.50
8 64.20000 161.8333 184162.17

The result includes combination of Region="South", Cold=TRUE that never appears in the data, as I expect. Individually, each of Region="South" and Cold=TRUE appears in the data.

Another issue is that, in the example, function 'mean' is also applied to subset corresponding to Region="South" and Cold=TRUE, that has zero row, giving NaN. I doesn't have a strong opinion on it. It is a fine choice, but different from 'tapply'.
Comment 1 Suharto Anggono 2017-06-01 08:16:09 UTC
Created attachment 2257 [details]
Against R devel r72750
Comment 2 Suharto Anggono 2017-06-04 02:08:24 UTC
Created attachment 2258 [details]
Against R devel r72750, not call function on empty subsets
Comment 3 Suharto Anggono 2017-06-04 13:01:48 UTC
Created attachment 2259 [details]
Against R devel r72750, not call function on empty subsets

This is correct and looks simpler.
If length(sort(unique(grp))) == length(lev), sort(unique(grp)) and 'lev' may be different if there is malformed factor.