The whole session is from EMC (!)
Characteristics of Backup Workloads in Production Systems
Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu, EMC Corporation
My takeaway: backup system differs from primary system. For primary system characters, look at Microsoft paper; for backup, look at this one
Motivation: back-up storage system differs from primary storay system
Study a lot of system to characterize them (compared w/ Microsoft study on primary system)
File size: mean size 22M for primary system
2G for backup system
File counts: mean=14K for backup (backup don’t organize file as human do)
Mean=117K for primary
Deduplication: 3X-6X for primary, 384x for backup
Impact of chunk size: (use content-defined merging techniques to make 1K figureprints to 2K, 4K and 16K)
15% better deduplication for 1/2 of chunk size, but 2x metadata.
Best 4K or 8K.
Microsoft: for primary data, per file dedup works 87% as good as chunks! (only for primary, but not backup)
Caching(replay traces with different cache size):
Moral: more cache better, but there is a turning point.
Q&A:
Q: Data/trace available?
A: can’t make promise. But have your students come intern with us!
Q:speed/performance based on chunk size?
A: didn’t look at it.
Q: newer data means lower deduplication.
WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression
Philip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu, EMC Corporation
My take away: when transferring data through WAN, reduce data aggressively.Do everything you can: de-dup, compression, find similarities using sketch, etc… And this paper basically talks about how to effectively cache sketches.
Remote backup data (WAN is bottleneck!)
How to optimize: Dedup, local compression, stream-informed caching!
Sketch to find similar blocks, then transfer fps and differences
Sketh index option:
Full index (0.5T for 256 TB data per super-feature). And random I/O on those 0.5T! Cn find all similar matches.
Partial index (maybe LRU policy? Not persistent.): has to hold a full back-up to be effective (as big as full index)
Stream-informed cache
Key-insights: delta locality corresponds to dedup locality
Build sketches on the fly, for corresponding dedup chunks. (dependent on stream locality and cache policy)
Evaluation: Stream-informed cache 15% worse than full index in terms of finding matches.
Effective bandwidth 100x actually bandwidth, 2x more than w/o sketches.
Overhead: 20% slowdown on writes, but only for non-duplicates
Power Consumption in Enterprise-Scale Backup Storage Systems
Zhichao Li, Stony Brook University; Kevin M. Greenan and Andrew W. Leung, EMC Corporation; Erez Zadok, Stony Brook Universit
My takeaway: controller more power hungry on disks. But no further breakdown of power consumption from this paper.
Motivation: power design important, but no measurements!
Measurements: idle power consumption 200~800W (0.8-3W ~ TB)
Dedupiplication saves power: save space, thus hw, save I/O, thus energy
Spin-down v.s power-down: sping dow saves 6.5W per disk, while power down saves 7.6W to 9.3W per disk. (Still, 56% of power consumption out of disk!)
Conclusion: Controller power hungury than disk!
System not power proportional
Disparate consumption between similar H/W
Q&A:
Q: if more disk, then save power?
A: depends on customer requirements
没有评论:
发表评论