Bug 16709 - Apparent bug in summary.data.frame() with columns of Date class and NA's present
Summary: Apparent bug in summary.data.frame() with columns of Date class and NA's present
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Language (show other bugs)
Version: R 3.2.3
Hardware: All All
: P5 normal
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2016-02-11 14:15 UTC by Marc Schwartz
Modified: 2016-02-29 21:51 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Schwartz 2016-02-11 14:15:00 UTC
There appears to be a bug in summary.data.frame() in the case where a data frame contains Date class columns that contain NA's and other columns, if present, do not.

Links to R-Help and R-Devel discussions:

https://stat.ethz.ch/pipermail/r-help/2016-February/435992.html
https://stat.ethz.ch/pipermail/r-devel/2016-February/072302.html


Example:

x <- c(18000000, 18810924, 19091227, 19027233, 19310526, 19691228, NA)
x.Date <- as.Date(as.character(x), format = "%Y%m%d")

DF.Dates <- data.frame(Col1 = x.Date)

summary(x.Date)
       Min.      1st Qu.       Median         Mean      3rd Qu. 
"1881-09-24" "1902-12-04" "1920-09-10" "1923-04-12" "1941-01-17" 
       Max.         NA's 
"1969-12-28"          "3" 


# NA's missing from output
summary(DF.Dates)
     Col1           
Min.   :1881-09-24  
1st Qu.:1902-12-04  
Median :1920-09-10  
Mean   :1923-04-12  
3rd Qu.:1941-01-17  
Max.   :1969-12-28  


DF.Dates$x1 <- 1:7

DF.Dates
       Col1 x1
1       <NA>  1
2 1881-09-24  2
3 1909-12-27  3
4       <NA>  4
5 1931-05-26  5
6 1969-12-28  6
7       <NA>  7

# NA's still missing
summary(DF.Dates)
     Col1                  x1     
Min.   :1881-09-24   Min.   :1.0  
1st Qu.:1902-12-04   1st Qu.:2.5  
Median :1920-09-10   Median :4.0  
Mean   :1923-04-12   Mean   :4.0  
3rd Qu.:1941-01-17   3rd Qu.:5.5  
Max.   :1969-12-28   Max.   :7.0  


DF.Dates$x2 <- c(1:6, NA)

# NA's show if another column has any
summary(DF.Dates)
     Col1                  x1            x2      
Min.   :1881-09-24   Min.   :1.0   Min.   :1.00  
1st Qu.:1902-12-04   1st Qu.:2.5   1st Qu.:2.25  
Median :1920-09-10   Median :4.0   Median :3.50  
Mean   :1923-04-12   Mean   :4.0   Mean   :3.50  
3rd Qu.:1941-01-17   3rd Qu.:5.5   3rd Qu.:4.75  
Max.   :1969-12-28   Max.   :7.0   Max.   :6.00  
NA's   :3                          NA's   :1     


The issue appears to occur as a result of the following code in summary.data.frame():

 nr <- if (nv) 
      max(unlist(lapply(z, NROW)))
  else 0


In the case of Date class objects, summary.Date() tracks the counts of NAs in an attribute called "NAs":

x <- summary.default(unclass(object), digits = digits, ...)
if (m <- match("NA's", names(x), 0)) {
      NAs <- as.integer(x[m])
      x <- x[-m]
      attr(x, "NAs") <- NAs
  }

It should be noted that summary.POSIXct() has the same code.

As a result, the value of 'nr' above is only 6, rather than 7, which in effect, truncates the printed result from summary.data.frame() via print.summaryDefault().

In the case of numeric vectors, for example, 'nr' would be 7, because the NA count is stored in a vector element, rather than in an attribute. This is due to different NA handling code in summary.default(). Hence, when another non-Date column contains NA's, the NA count for the Date column is included in the output, but not otherwise.

A possible solution to this issue may be as simple as a modification to the code in summary.data.frame() that creates 'nr' above, to take into account the presence/absence of the "NAs" attribute for Date and POSIXct class objects in the two functions:

nr <- if (nv) 
       max(unlist(lapply(z, function(x) NROW(x) + !is.null(attr(x, "NAs")))))
     else 0

so that 'nr' would be 7, rather than 6, when NAs are present. Some testing would suggest that this resolves the problem.

However, I would acknowledge that there may be other considerations, given some of the inter-dependencies across these summary methods.

Thanks.

Marc
Comment 1 Martin Maechler 2016-02-29 14:31:02 UTC
Thank you, Marc.
Your diagnosis and proposed fix seem very reasonable and appropriate.. at least from just reading it.

I'll look into checking the fix .. with the goal that it can even make R 3.2.4.
Comment 2 Marc Schwartz 2016-02-29 18:35:49 UTC
Hi Martin,

Thanks for your follow up on this.

If the proposed fix is deemed to be correct, I can certainly generate a patch for you against the requisite code base(s), if it would help.

Let me know.

Regards,

Marc
Comment 3 Martin Maechler 2016-02-29 21:51:09 UTC
committed fix as r70244 several hours ago.  Will be ported to R-patched later