There appears to be a bug in summary.data.frame() in the case where a data frame contains Date class columns that contain NA's and other columns, if present, do not. Links to R-Help and R-Devel discussions: https://stat.ethz.ch/pipermail/r-help/2016-February/435992.html https://stat.ethz.ch/pipermail/r-devel/2016-February/072302.html Example: x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA) x.Date <- as.Date(as.character(x), format = "%Y%m%d") DF.Dates <- data.frame(Col1 = x.Date) summary(x.Date) Min. 1st Qu. Median Mean 3rd Qu. "1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" Max. NA's "1969-12-28" "3" # NA's missing from output summary(DF.Dates) Col1 Min. :1881-09-24 1st Qu.:1902-12-04 Median :1920-09-10 Mean :1923-04-12 3rd Qu.:1941-01-17 Max. :1969-12-28 DF.Dates$x1 <- 1:7 DF.Dates Col1 x1 1 <NA> 1 2 1881-09-24 2 3 1909-12-27 3 4 <NA> 4 5 1931-05-26 5 6 1969-12-28 6 7 <NA> 7 # NA's still missing summary(DF.Dates) Col1 x1 Min. :1881-09-24 Min. :1.0 1st Qu.:1902-12-04 1st Qu.:2.5 Median :1920-09-10 Median :4.0 Mean :1923-04-12 Mean :4.0 3rd Qu.:1941-01-17 3rd Qu.:5.5 Max. :1969-12-28 Max. :7.0 DF.Dates$x2 <- c(1:6, NA) # NA's show if another column has any summary(DF.Dates) Col1 x1 x2 Min. :1881-09-24 Min. :1.0 Min. :1.00 1st Qu.:1902-12-04 1st Qu.:2.5 1st Qu.:2.25 Median :1920-09-10 Median :4.0 Median :3.50 Mean :1923-04-12 Mean :4.0 Mean :3.50 3rd Qu.:1941-01-17 3rd Qu.:5.5 3rd Qu.:4.75 Max. :1969-12-28 Max. :7.0 Max. :6.00 NA's :3 NA's :1 The issue appears to occur as a result of the following code in summary.data.frame(): nr <- if (nv) max(unlist(lapply(z, NROW))) else 0 In the case of Date class objects, summary.Date() tracks the counts of NAs in an attribute called "NAs": x <- summary.default(unclass(object), digits = digits, ...) if (m <- match("NA's", names(x), 0)) { NAs <- as.integer(x[m]) x <- x[-m] attr(x, "NAs") <- NAs } It should be noted that summary.POSIXct() has the same code. As a result, the value of 'nr' above is only 6, rather than 7, which in effect, truncates the printed result from summary.data.frame() via print.summaryDefault(). In the case of numeric vectors, for example, 'nr' would be 7, because the NA count is stored in a vector element, rather than in an attribute. This is due to different NA handling code in summary.default(). Hence, when another non-Date column contains NA's, the NA count for the Date column is included in the output, but not otherwise. A possible solution to this issue may be as simple as a modification to the code in summary.data.frame() that creates 'nr' above, to take into account the presence/absence of the "NAs" attribute for Date and POSIXct class objects in the two functions: nr <- if (nv) max(unlist(lapply(z, function(x) NROW(x) + !is.null(attr(x, "NAs"))))) else 0 so that 'nr' would be 7, rather than 6, when NAs are present. Some testing would suggest that this resolves the problem. However, I would acknowledge that there may be other considerations, given some of the inter-dependencies across these summary methods. Thanks. Marc
Thank you, Marc. Your diagnosis and proposed fix seem very reasonable and appropriate.. at least from just reading it. I'll look into checking the fix .. with the goal that it can even make R 3.2.4.
Hi Martin, Thanks for your follow up on this. If the proposed fix is deemed to be correct, I can certainly generate a patch for you against the requisite code base(s), if it would help. Let me know. Regards, Marc
committed fix as r70244 several hours ago. Will be ported to R-patched later