2012年2月15日星期三

FAST'12 Session 2: Back it up!

The whole session is from EMC (!)


Characteristics of Backup Workloads in Production Systems

Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu, EMC Corporation

My takeaway: backup system differs from primary system. For primary system characters, look at Microsoft paper; for backup, look at this one

Motivation: back-up storage system differs from primary storay system

Study a lot of system to characterize them (compared w/ Microsoft study on primary system)

File size: mean size 22M for primary system

2G for backup system

File counts: mean=14K for backup (backup don’t organize file as human do)

Mean=117K for primary

Deduplication: 3X-6X for primary, 384x for backup

Impact of chunk size: (use content-defined merging techniques to make 1K figureprints to 2K, 4K and 16K)

15% better deduplication for 1/2 of chunk size, but 2x metadata.

Best 4K or 8K.

Microsoft: for primary data, per file dedup works 87% as good as chunks! (only for primary, but not backup)

Caching(replay traces with different cache size):

Moral: more cache better, but there is a turning point.

Q&A:

Q: Data/trace available?

A: can’t make promise. But have your students come intern with us!

Q:speed/performance based on chunk size?

A: didn’t look at it.

Q: newer data means lower deduplication.




WAN Optimized Replication of Backup Datasets Using Stream-Informed Delta Compression

Philip Shilane, Mark Huang, Grant Wallace, and Windsor Hsu, EMC Corporation

My take away: when transferring data through WAN, reduce data aggressively.Do everything you can: de-dup, compression, find similarities using sketch, etc… And this paper basically talks about how to effectively cache sketches.

Remote backup data (WAN is bottleneck!)

How to optimize: Dedup, local compression, stream-informed caching!

Sketch to find similar blocks, then transfer fps and differences

Sketh index option:

Full index (0.5T for 256 TB data per super-feature). And random I/O on those 0.5T! Cn find all similar matches.

Partial index (maybe LRU policy? Not persistent.): has to hold a full back-up to be effective (as big as full index)

Stream-informed cache

Key-insights: delta locality corresponds to dedup locality

Build sketches on the fly, for corresponding dedup chunks. (dependent on stream locality and cache policy)

Evaluation: Stream-informed cache 15% worse than full index in terms of finding matches.

Effective bandwidth 100x actually bandwidth, 2x more than w/o sketches.

Overhead: 20% slowdown on writes, but only for non-duplicates




Power Consumption in Enterprise-Scale Backup Storage Systems
Zhichao Li, Stony Brook University; Kevin M. Greenan and Andrew W. Leung, EMC Corporation; Erez Zadok, Stony Brook Universit

My takeaway: controller more power hungry on disks. But no further breakdown of power consumption from this paper.

Motivation: power design important, but no measurements!

Measurements: idle power consumption 200~800W (0.8-3W ~ TB)

Dedupiplication saves power: save space, thus hw, save I/O, thus energy

Spin-down v.s power-down: sping dow saves 6.5W per disk, while power down saves 7.6W to 9.3W per disk. (Still, 56% of power consumption out of disk!)

Conclusion: Controller power hungury than disk!

System not power proportional

Disparate consumption between similar H/W

Q&A:

Q: if more disk, then save power?

A: depends on customer requirements

没有评论:

发表评论