A few days ago I had to deflate some compressed content using Rust. I quickly found the flate2 crate, it is well maintained and has a very high usage. I decided it would be a great fit for my little project, a CLI tool to download, decompress, and store in a local SQLite3 some content from AWS S3.
I wrote my small CLI tool, and it seemed to work perfectly on the first run. It was all nice until some time later. I have noticed that some lines of my compressed file were missing. The code was running fine and didn’t provide any warnings. After a few hours of debugging, I was sure there was a bug in my code somewhere. I went on and wrote a bunch more test to try to isolate in a reproducible way, where the problem was present, without any luck. All tests passing and no indication of problems anywhere.
At some point, I decided to use a sample from the original compressed files, and that is how I managed to reproduce the bug. The sample code in the README of the flate2 crate was what I used in my code:
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let mut d = GzDecoder::new("...".as_bytes());
let mut s = String::new();
d.read_to_string(&mut s).unwrap();
println!("{}", s);
}
Unfortunately, the GzDecoder
cannot successfully decompress my files. I went then to the official library repository and found this issue which describes exactly my problem. I had to use the MultiGzDecoder
instead, it all worked as expected after this change and I could decompress successfully my files. The example in the README of flate2 should probably be this:
use std::io::prelude::*;
use flate2::read::GzDecoder;
fn main() {
let mut d = MultiGzDecoder::new("...".as_bytes());
let mut s = String::new();
d.read_to_string(&mut s).unwrap();
println!("{}", s);
}
Finally, I still don’t know why the MultiGzDecoder
works when the GzDecoder
don’t. If you know, I will update this blog with your answer.
MultiGzDecoder works and other does not because gzip headers store the size of the file. But gzip files can be concatenated and would still be compliant and be decompressed. That is why they have implemented MultiGzDecoder. MultiGzDecoder does not care about the size specified in header and continues to decompress.
In your case the header may be having 0 as size. Which forces immediate termination of GzDecoder.
I know this because in Bioinformatics block zipping (bgzip) is very common due to performance gains.