Bug 14479 - Read from bzfile wrongly truncates?
Read from bzfile wrongly truncates?
Status: CLOSED INVALID
Product: R
Classification: Unclassified
Component: I/O
R 2.12.1
x86_64/x64/amd64 (64-bit) Linux-Debian
: P5 major
Assigned To: R-core
http://ala.boku.ac.at:4080/kreil/scra...
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-18 23:24 UTC by rbugs09
Modified: 2014-02-16 11:43 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description rbugs09 2011-01-18 23:24:34 UTC
With certain files, reading from a bzfile apparently truncates prematurely. This can (for certain files) be reproduced through the following code:

infileG<-'chipcomp_v4_raw_sample_1_G.dat.bz2'

inf<-bzfile(infileG);
ln<-readLines(inf);
length(ln)
# --> 889  This is too short!

inf<-pipe(paste('bzcat',infileG));
ln<-readLines(inf);
length(ln)
# --> 5111  This is correct.

The same problem occurs with read.table. Am I doing something wrong or is this a bug?

Using "pipe" seems a valid workaround but is probably less efficient.


A file to reproduce this is available here:
    http://ala.boku.ac.at:4080/kreil/scratch/tmp/Rbug/

Best regards,
David


$ uname -a
Linux ala 2.6.32-bpo.5-amd64 #1 SMP Sat Sep 18 19:03:14 UTC 2010 x86_64 GNU/Linux

> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
Comment 1 Brian Ripley 2011-01-19 12:45:56 UTC
Is this file a single compressed stream?  See the comments about this in ?file.
Comment 2 rbugs09 2011-01-19 12:55:46 UTC
The file is a compressed file on disk (see http link).

The problem also persist if I explicitly open the file in "rb" mode.

Best regards,
David
Comment 3 Brian Ripley 2011-01-24 21:02:35 UTC
The file must be corrupt: try bunzip2 and then bzip2 on it, and it works for me.
Comment 4 rbugs09 2011-01-24 21:33:04 UTC
Hmm, it has not been created by bzip2, yet that does not automatically make it corrupt as long as it conforms to the file format specifications.

I can see that decompressing and recompressing with bunzip2/bzip2 might help, as the original file was generated by another compressor with bzip2 format support (pbzip2, specifically).

But how come that "bzcat" or "bunzip2 -c" have no problem processing the original file but bzfile in R does? The original file clearly conforms to bz2 file format specifications to the point that bunzip2 will read and process it with no problems.

Does not R use the same library as bunzip2? If so, why does bunzip2 read the original file without any problems but R does not?

It seems I cannot change the ticket status from "invalid". I do not think that the functionality limitation / bug has been addressed.

Best regards,
David
Comment 5 Simon Urbanek 2011-01-24 23:26:43 UTC
Well, you have not proven that the file does adhere to the specs, you don't even know what the file really contains! [Technically, it would not really matter since we don't claim to support anything else but bzip2 output...] 

The file is really a concatenation of multiple streams, each being 900000 bytes long. Since end-of-stream is signaled at the first 900000 bytes, that's what you get. bzcat does concatenate all input streams - that's why you get the longer result.
Comment 6 rbugs09 2011-01-24 23:51:13 UTC
Hmm, fair enough -- point taken.

I had thought that bzfile(infile) would/should be functionally equivalent to pipe(paste('bzcat',infile)).

I see that this does not have to be true if there are multiple streams in a file. If it would be easy to add support for multi-core compressed pbzip2 files (by essentially doing what bzcat does), I think that would be very helpful.

Best regards,
David.
Comment 7 Jackie Rosen 2014-02-16 11:43:18 UTC
(spam comment removed)