Bug 17333 - tapply function bug
Summary: tapply function bug
Status: CLOSED FIXED
Alias: None
Product: R
Classification: Unclassified
Component: Accuracy (show other bugs)
Version: R 3.4.1
Hardware: x86_64/x64/amd64 (64-bit) Windows 64-bit
: P5 critical
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2017-08-28 20:21 UTC by Jiqiong Dai
Modified: 2017-09-19 16:25 UTC (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jiqiong Dai 2017-08-28 20:21:25 UTC
Please test the following simple calculation:

x <- cbind( "a" = 1 : 2, "b" = 1 : 2, "c" = 1 : 2 )

tapply( x[ , "c" ], INDEX = list( x[ , "a" ], x[ , "b" ] ), FUN = as.character )

The correct answer should be:



  1   2  
1 "1" NA 
2 NA  "2"


But R 3.4.1 shows that 

  1   2  
1 "1" "" 
2 ""  "2"


Which is clearly incorrect. I suspect that some very fundamental functions might be incorrect in this particular release. It might widely affect many many more functions depending on these simple calculations.
Comment 1 Suharto Anggono 2017-08-29 15:53:55 UTC
I am responsible.
In https://stat.ethz.ch/pipermail/r-devel/2017-January/073684.html , I said: "To initialize array, a zero-length vector can also be used."
In https://stat.ethz.ch/pipermail/r-devel/2017-February/073694.html , I said this.
vector(typeof(ans))
(or  vector(storage.mode(ans)))
has length zero and can be used to initialize array.

In fact, when character(0) is the initializer, the array is filled with "", not NA.

Defending myself, it is actually a bug of function 'array'. The documentation, in "Value" section, says: "If data has length zero, NA of an appropriate type is used for atomic vectors (0 for raw vectors) and NULL for lists."

vector(typeof(ans))[1L] works to initialize array.
Comment 2 Martin Maechler 2017-08-29 16:26:08 UTC
(In reply to Suharto Anggono from comment #1)
> I am responsible.
> In https://stat.ethz.ch/pipermail/r-devel/2017-January/073684.html , I said:
> "To initialize array, a zero-length vector can also be used."
> In https://stat.ethz.ch/pipermail/r-devel/2017-February/073694.html , I said
> this.
> vector(typeof(ans))
> (or  vector(storage.mode(ans)))
> has length zero and can be used to initialize array.
> 
> In fact, when character(0) is the initializer, the array is filled with "",
> not NA.
> 
> Defending myself, it is actually a bug of function 'array'. The
> documentation, in "Value" section, says: "If data has length zero, NA of an
> appropriate type is used for atomic vectors (0 for raw vectors) and NULL for
> lists."
> 
> vector(typeof(ans))[1L] works to initialize array.

Thank you, Suharto.   Yes indeed, it is odd that this is needed because of the character case.
Indeed there _is_ bug in array here,  either in its implementation or its documentation.
The responsibility for the new bug is also with me who did commit the current code to the R sources.

I will commit the above "workaround" for now (after adding a regression test and running  'make check-all' .. so probably not today anymore.

Martin
Comment 3 Martin Maechler 2017-08-29 22:17:56 UTC
(In reply to Martin Maechler from comment #2)
> (In reply to Suharto Anggono from comment #1)

> Indeed there _is_ bug in array here,  either in its implementation or its
> documentation.
> The responsibility for the new bug is also with me who did commit the
> current code to the R sources.
> 
> I will commit the above "workaround" for now (after adding a regression test
> and running  'make check-all' .. so probably not today anymore.

done that in the mean time (port to R-patched will wait 2 days or so).

I've looked a bit more closely at the array() bug.
From reading  src/main/array.c    it seems really an oversight that  
array(""[0], 1:2) does not give character NAs  (the source code has a comment that strings (and lists, exprs) *are* already initialized even when not filled with NA, 
but strings really are initialized by the equivalent of  character(n)   which does not give NA_character_  but "" {and I am *not* proposing to change that!}.

I've started to run   make check-all with the obvious change to src/main/array.c  and will see tomorrow.
Comment 4 Suharto Anggono 2017-09-19 16:13:33 UTC
(In reply to Suharto Anggono from comment #1)
> In fact, when character(0) is the initializer, the array is filled with "",
> not NA.
> 
> Defending myself, it is actually a bug of function 'array'. The
> documentation, in "Value" section, says: "If data has length zero, NA of an
> appropriate type is used for atomic vectors (0 for raw vectors) and NULL for
> lists."

History, from the source: With zero-length input, function 'array' behaves that way since R 1.9.0. It is broken for character(0) in R 2.15.2, when 'array' is implemented in C code.
Comment 5 Martin Maechler 2017-09-19 16:25:55 UTC
(In reply to Suharto Anggono from comment #4)
> (In reply to Suharto Anggono from comment #1)
> > In fact, when character(0) is the initializer, the array is filled with "",
> > not NA.
> > 
> > Defending myself, it is actually a bug of function 'array'. The
> > documentation, in "Value" section, says: "If data has length zero, NA of an
> > appropriate type is used for atomic vectors (0 for raw vectors) and NULL for
> > lists."
> 
> History, from the source: With zero-length input, function 'array' behaves
> that way since R 1.9.0. It is broken for character(0) in R 2.15.2, when
> 'array' is implemented in C code.

Yes, of course. I did agree it's a bug to be fixed (comment #c2 and #c3), and mentioned I would fix it.  Done for R-devel already end of August (when I wrote here), and then have ported it to R patched, now R 3.4.2 beta, eight days ago.

So yes, now we could (very slightly) simplify tapply() again to make use of the new array() consistency.