
MSST'16 session 3 Store More, Longer, and for Less: Deduplication and Archival Systems

A Long Term User-Centric Analysis of Deduplication Patterns

study a dataset of 21 months, 1 snapshot per user per day

a lot of small files (< 1M), but a few large files consume most of the space
in general, small files achieve higher deduplication ratio than large files 

per-user deduplication ratio, redundancy (across users),  differs a lot  

Lazy Exact Deduplication

postpone disk lookups (fingerprints lookup) until we can do them in a batch

Sorted Deduplication: How to Process Thousands of Backup Streams

requirement is changing: a few large streams ---> many streams (e.g., cloud backup)

Effects of Prolonged Media Usage and Long-term Planning on Archival Systems

preserving data for ~100 to ~1000 years

when do you retire/replace media?
how long do you plan for?

Failure scenarios: device failures and economic failure 

1. should media be used past their manufacture suggested service life or warranty period? 
    (for archival data disk might last longer) 

have a model to model the purchase, maintaining and retiring phase to calculate cost 

