2013年11月18日星期一

MIGRATORY COMPRESSION

EMC

Coarse-grained Data Reordering to IMprove COmpressbility

compression get limited by its window size, larger window size leads to lsower compression

MC: first to coarse-grained reorganization to group similar blocks, so that you can improve compressibility with the same window size

Gzip: HUffman coding 64KB window
Bzip2: 900KB window, Burrows-WHeelses Transformation abd Huffman coding
7z: Markov chain-based range codec, 1GB window
(However, no apple to apple comprison as the don't know how to increase the window size of Gzip etc...)

Almost reverse linear relation between throughput and compression factor

Idea: improve compression by grouping similar blocks

Burrows-Wheelers: reorder within block
MC: reorder the blocks

Usecase:
1. compress a single, large file
2. Archival tier of deduplicating storage system
3. network transfers (maybe?)

Delta compression: just store how to transfor from A to A' in order to store A'. OK for single file, too complex for a deduplicating storage system, not good for locality neighter

Workflow of MC:
to compress:
segmentation (partition into blocks, calculate super fingerprints)----> similarity detector   -----> re-organizer
to un-compress:
header to specify how to re-construct file

Similarity Detection: similar to "reduendancy eleimation at block level", caluclating features by some set of rolling hashes.

Ororganization: reorganize input file, according to a recipe
1. in-memory, read in entire file
2. chunk-level: random access
3. multi-pass: in general better than chunk level

Evaluation:
At least for an in-memory ssystem, adding MC before compression never does worse than vanilla copression. If things don't fit in memory, not that much because of random accesses incurred. SSD helps though. --> Remzi: can you just limit the re-ordering scope so that it fits into memory?

Uncompression has 30% to 70% overhead in an archiveral system though, due to random access


Q: How about workload specific compression? As if you know the workload better, you may be able to compress better? MC is a pretty generic technique though?
A. I need to think about it.
     But NetAPP's vedio compression falls into that..














没有评论:

发表评论