Bug 15318 - parallel compression
Summary: parallel compression
Alias: None
Product: R
Classification: Unclassified
Component: Wishlist (show other bugs)
Version: R 3.0.0
Hardware: All Linux
: P5 enhancement
Assignee: R-core
Depends on:
Reported: 2013-05-22 17:31 UTC by Patrick McCann
Modified: 2013-05-22 20:34 UTC (History)
0 users

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Patrick McCann 2013-05-22 17:31:28 UTC
It would be really nice if the save and load functions supported parallel compression from the pbzip2 and pigz libraries. Compressing in parallel can dramatically speed up a workflow.
Comment 1 Brian Ripley 2013-05-22 19:06:25 UTC
Please supply evidence of a real problem in which this is the case.
Comment 2 Patrick McCann 2013-05-22 20:34:03 UTC
(In reply to comment #1)
> Please supply evidence of a real problem in which this is the case.

Using bzip2 I get this timing:

> a<-matrix(rnorm(1e8),1e4,1e4)
> print(object.size(a), units='auto')
762.9 Mb
> system.time(save(a, file='car.RData', compress='bzip2'))
   user  system elapsed 
165.892   2.112 169.956 

On my ec2 instance with a pretty fast disk drive I get this when using pbzip2

> begin<-Sys.time()
> save(a, file='car3.RData')
> system('pbzip2 car3.RData')
> Sys.time()-begin
Time difference of 49.23157 secs

I believe the pbzip2 workflow includes an additional write to disk. So it would be even faster if we could do something like 

save(a, file='car.RData', compress='pbzip2')