Custom compression for huge data sets
9 21 Feb 2016 09:32 by u/fvhy
if i have a bunch of highly regularized data, that is always accessed sequentially, do you think its possible to write a custom compression algorithm that has same ratio as bzip2, but is much faster on reading?
data is equal length records, each containing bunch of ints and doubles, but usually there is only small deltas between nearby records.
bzip2 works great, but when analyzing the data, speed bottleneck is decompression.
I'm thinking of implementing something like: generate frequency counts on all deltas between fields of subsequent records, then pre-create static huffman tables for every field delta. Also generate bitmask of which fields changed, once again generate a static huffman table for that. Then encode data as: baseline record, number of compressed records, and for each compressed record, huffman encoded mask of fields changed, and then huffman encoded delta for each changed field
Does this sound like its workable? is there a good/fast C++ library for huffman encoding, or do i roll my own?
6 comments
4 u/0x7a69 21 Feb 2016 22:31
You want exactly lz4 compression.
0 u/1HepCat 22 Feb 2016 07:10
Agreed. LZ4 is the reigning champ when one needs high IO/throughput (though other algorithms achieve higher compression ratios). Snappy is another codec to consider--it's almost as fast but sometimes available/compatible where LZ4 isn't.
Another general-purpose tool to look into would be the Apache Parquet format. It makes use of run-length encoding, bit packing and record shredding to promote a very concise representation of encoded data prior to compressing it. It does look like there's a C++ library (https://github.com/apache/parquet-cpp), although I don't know how robust it is compared to the Java implementation. Here's an interesting article about it: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
0 u/FSB 22 Feb 2016 10:45
Probably LZ4HC which is a lot slower when compressing than regular lz4 (still beats gzip usually) but the decompression is even faster than regular lz4! (It is just a better compressor for the lz4 format, so the decompression code is exactly the same)
1 u/OlympicWalrus 21 Feb 2016 21:33
Your idea makes sense, and you should be able to create a faster and more compressed format since you are taking advantage of knowing exactly what the data looks like.
It would be a fun pet project. In a production system you would be better served by parallelizing the problem instead of creating a one-off compression format.
0 u/Craxic 22 Feb 2016 07:33
LZ4 definitely sounds like what you want, but FWIW, checkout Brotli.
0 u/dchem 22 Feb 2016 20:31
Is the data something that can be put into a table?
If so, have you looked into columnar database called fastbit?
It uses bitmap indexing and word-aligned hybrid compression to have a good space-performance trade-off.