With certain files, reading from a bzfile apparently truncates prematurely. This can (for certain files) be reproduced through the following code:
# --> 889 This is too short!
# --> 5111 This is correct.
The same problem occurs with read.table. Am I doing something wrong or is this a bug?
Using "pipe" seems a valid workaround but is probably less efficient.
A file to reproduce this is available here:
$ uname -a
Linux ala 2.6.32-bpo.5-amd64 #1 SMP Sat Sep 18 19:03:14 UTC 2010 x86_64 GNU/Linux
R version 2.12.1 (2010-12-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
 LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
 LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
 LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
 LC_PAPER=en_US.UTF-8 LC_NAME=C
 LC_ADDRESS=C LC_TELEPHONE=C
 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
 stats graphics grDevices utils datasets methods base
Is this file a single compressed stream? See the comments about this in ?file.
The file is a compressed file on disk (see http link).
The problem also persist if I explicitly open the file in "rb" mode.
The file must be corrupt: try bunzip2 and then bzip2 on it, and it works for me.
Hmm, it has not been created by bzip2, yet that does not automatically make it corrupt as long as it conforms to the file format specifications.
I can see that decompressing and recompressing with bunzip2/bzip2 might help, as the original file was generated by another compressor with bzip2 format support (pbzip2, specifically).
But how come that "bzcat" or "bunzip2 -c" have no problem processing the original file but bzfile in R does? The original file clearly conforms to bz2 file format specifications to the point that bunzip2 will read and process it with no problems.
Does not R use the same library as bunzip2? If so, why does bunzip2 read the original file without any problems but R does not?
It seems I cannot change the ticket status from "invalid". I do not think that the functionality limitation / bug has been addressed.
Well, you have not proven that the file does adhere to the specs, you don't even know what the file really contains! [Technically, it would not really matter since we don't claim to support anything else but bzip2 output...]
The file is really a concatenation of multiple streams, each being 900000 bytes long. Since end-of-stream is signaled at the first 900000 bytes, that's what you get. bzcat does concatenate all input streams - that's why you get the longer result.
Hmm, fair enough -- point taken.
I had thought that bzfile(infile) would/should be functionally equivalent to pipe(paste('bzcat',infile)).
I see that this does not have to be true if there are multiple streams in a file. If it would be easy to add support for multi-core compressed pbzip2 files (by essentially doing what bzcat does), I think that would be very helpful.
(spam comment removed)