Bug 16427 - Errors/Warnings using seek on a gz connection
Summary: Errors/Warnings using seek on a gz connection
Status: UNCONFIRMED
Alias: None
Product: R
Classification: Unclassified
Component: I/O (show other bugs)
Version: R 3.2.0
Hardware: Other Linux
: P5 enhancement
Assignee: R-core
URL:
Depends on:
Blocks:
 
Reported: 2015-06-15 16:02 UTC by Barry Rowlingson
Modified: 2015-11-18 16:32 UTC (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Barry Rowlingson 2015-06-15 16:02:19 UTC
Reproducible example (on a system with gzip in PATH):

set.seed(123)
m=data.frame(z=runif(10000),x=rnorm(10000))
write.csv(m,"m.csv")
system("gzip m.csv")

that creates a file about 195k long. Open, and see if its seekable:

gzf=gzfile("m.csv.gz")
open(gzf,"rb")
isSeekable(gzf)

I see [1] TRUE.

Now, read 20 chars, seek back, read 20 chars again - we get the same 20 chars:

> readChar(gzf,nchar=20)
[1] "\"\",\"z\",\"x\"\n\"1\",0.287"
> seek(gzf,0)
[1] 20
> readChar(gzf,nchar=20)
[1] "\"\",\"z\",\"x\"\n\"1\",0.287"

Now lets seek to 1000 in 100 byte steps:

> for(i in seq(100,1000,by=100)){seek(gzf,i)}
> readChar(gzf, nchar=20)
[1] "9115822,-0.187725488"

Now jump back to 1000 after reading those 20 chars:

> seek(gzf, 1000)
[1] 1020
Warning messages:
1: In seek.connection(gzf, 1000) : invalid or incomplete compressed data
2: In seek.connection(gzf, 1000) :
  seek on a gzfile connection returned an internal error

 Its only a warning, has anything bad happened? Lets see:

> readChar(gzf, nchar=20)
[1] ""
There were 20 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In readChar(gzf, nchar = 20) : invalid or incomplete compressed data
2: In readChar(gzf, nchar = 20) : invalid or incomplete compressed data
[etc]

This seems serious enough to merit more than a warning.

This also occurs if you try seeking straight to 230 after opening. 229 is okay, 230 isn't:

> gzf=gzfile("m.csv.gz")
> open(gzf,"rb")
> seek(gzf,229)
[1] 0
> gzf=gzfile("m.csv.gz")
> open(gzf,"rb")
> seek(gzf,230)
[1] 0
Warning messages:
1: In seek.connection(gzf, 230) : invalid or incomplete compressed data
2: In seek.connection(gzf, 230) :
  seek on a gzfile connection returned an internal error

(Similar code in Python using its gz library works fine)

?seek is quite scathing about seek on Windows platforms - even on non-gz connections. But this is Linux anyway:

> version
               _                           
platform       x86_64-unknown-linux-gnu    
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          2.0                         
year           2015                        
month          04                          
day            16                          
svn rev        68180                       
language       R                           
version.string R version 3.2.0 (2015-04-16)
nickname       Full of Ingredients         

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04 LTS

locale:
 [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8    
 [5] LC_MONETARY=en_GB.utf8    LC_MESSAGES=en_GB.utf8   
 [7] LC_PAPER=en_GB.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.2.0
Comment 1 Barry Rowlingson 2015-06-16 20:36:02 UTC
I've managed to hack a version of R 3.2.0 where connections.c does not use the supplied gzio.h and includes my system zlib.h (version 1.2.8 of zlib). Some changes of names of things in connections.c (and devPS.c) were needed because of an R_ prefix to subroutines in gzio.h. Also, had to pull out some #defines from gzio.c to get connections.c to compile. As I say, a hack.

With this build, my gz file seems to read okay when I seek as described in the bug report. My hypothesis is then that the gz code in gzio.h, based on zlib 1.2.8 as it is, could be buggy.

zlib had quite a refactoring after 1.2.5 so there's no easy drop-in replacement for gzio.h, and I don't have either the skills or the understanding to know why R has gzio.h anyway, and doesn't just link to zlib. I'm sure there's a good reason.