2012年2月17日星期五

FAST'12 Session 7: a bit of everything

Extracting Flexible, Replayable Models from Large Block Traces
V. Tarasov and S. Kumar, Stony Brook University; J. Ma, Harvey Mudd College; D. Hildebrand and A. Povzner, IBM Almaden Research; G. Kuenning, Harvey Mudd College; E. Zadok, Stony Brook University

Main idea: take standard traze, then divide them into chunks, define featuret function to turn traces into a muliti-demention diagrame, to trade accuracy for size reduction.
Q&A:
Q: Latency for dependent writes.
A: We don’t defer dependency. That’s too hard. But in general, people don’t do that with pure replay neither.






scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs
Harsha V. Madhyastha, University of California, Riverside; John C. McCullough, George Porter, Rishi Kapoor, Stefan Savage, Alex C. Snoeren, and Amin Vahdat, University of California, San Diego

I really like this one!!!!

How to do hardware provision to achieve performance goal and reduce cost.
Goal: understand configuration space (now and in future)
SCC: blocks+app workload + SLA = SLA-cost curve and for each cost, an instantion.
Sever Model: CPU+RAM+disk+network
App model: tasks + datasets + edges between tasks and datasets (I/O) + network?
SLA: operations per second
Compute: from input to output, some details like providing cache, enough CPU and stroage, etc.....
Evaluation: 4x reduce of cost for same SLA!!!!
Future: cloud employment model?
Q&A:
Q: energy cost? Future more demanding SLA on top of your current configuration?
A: We do include some power cost. We do look at if our configuration scales easily.
Q: Some thing about how you do simulation on computation cost?
A: Blah blah...didn’t understand...
Q: How to optimize software and hardware tunning all together to get maxiume performance?
A: Future work?
Q: Multiple app on same cluster?
A: Future work, as apps interfere with each other it will be hard.





iDedup: Latency-aware, Inline Data Deduplication for Primary Storage

Kiran Srinivasan, Tim Bisson, Garth Goodson, and Kaladhar Voruganti, NetApp, Inc.

How to do dedup in primary system (traditionally it is an offline feature)
Why inline dedup: no over provisioning for burst
no background processing
efficient use of resources
Key Design: 1. Only dedup consecutive blocks to reduce seek time, thus lower overhead (sequnce lenght configurable)
2. in memory FPDB: possible because for primary storage, it is smaller (size of FPDB cache also configurable)
Q&A
Q: You lost a lot of dedup opportunites in primary. As from one backup to another backup, there are a lot of duplication.
Q: In reality, outstanding I/O varies and bursts
A: That is a concern. We haven’t address it.

2012年2月16日星期四

FAST'12 Session 7: Cloud

BlueSky: A Cloud-Backed File System for the Enterprise
Michael Vrable, Stefan Savage, and Geoffrey M. Voelker, University of California, San Diego

Cloud Interface: only supports writing whole objects, but does support random read access
System design: NFS/CIFS interface (which itself has overhead!)
write-back caching, data asychrousely pushed to cloud
log-structured file system layout
cleaner running in the cloud (Amazon EC2)
Performance: if all hit cache, good. Otherwise, not so good. Whole segment pre-fetching improves performance.
Q&A:
Q: muliple proxies accessing the same data store?
A: thought about, not implemented. Need distributed locking if you want strong consistency.
Q; backup?
A: Because we are log structured, you just read old checkpoint region and don’t do cleaning!
Q: Was cleaning worth it? Why not just send more requests a time to increase bandwidth?
A: Big reason is cost.
Q: We don’t see those bad local numbers in Redhat!
A: Nothing special in our setup. If you use NetApp, of course it’s better.
Q: Consistency from cloud?
A: Not a big problem for us because we are log structured and don’t overwrite data!







Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads

Osama Khan and Randal Burns, Johns Hopkins University; James Plank and William Pierce,University of Tennessee; Cheng Huang, Microsoft Research


My takeaway: big-data aware erasure code is needed.

Replication too expensive for big data. Erasure coding (more fancy parity) comes to the rescue.
Two prominent operations: disk reconstruction
degraded reads
Erasure codes designed for I/O recovery. Rotated solomon code
Didn’t really understand...But basically it is a new algorithm accounted for (1. most are signle disk failure, 2. most failures are transient)








NCCloud: Applying Network Coding for the Storage Repair in a Cloud-of-Clouds

Yuchong Hu, Henry C.H. Chen, and Patrick P.C. Lee, The Chinese University of Hong Kong; Yang Tang, Columbia University

My takeaway:
if you store redundant data in a "cloud array", you want to do it in a traffic-aware way. (But why do you want to do it anyway?!)

Motivation:multiple cloud storage
They developed a proxy, which distributed data to multiple cloud transparently. Proxy can be mounted as a file system
Use MDS code for redundancy, and repair data when one cloud is not available
Goal: minimize repair traffic when doing repairing.
How: one chunk instead of whole file when reparing (use regenerating coding)
System design:
code chunk = linear combination of original data chunks.
repaire: one chunk from each survning nodes
and some details I don’t understand, but mostly coding details and how to minimize traffice

Q&A:
Q: What are the odds of losing two clouds?
A: Not all clouds as reliable as Amazon S3
Q: Cost for addtional code storage? Not really feasible?
A: Thank you for your comments...

FAST'12 Session 6: Mobile and Social

ZZFS: A Hybrid Device and Cloud File System for Spontaneous Users
Michelle L. Mazurek, Carnegie Mellon University; Eno Thereska, Dinan Gunawardena, Richard Harper, and James Scott, Microsoft Research, Cambridge, UK

My takeaway: interesting framework....

This is mainly a mircosoft study (comes from an internship)
Personal data requires collaboration, and coordinating device,
Key problem: unavailable devices (offline laptop?)
Eventual consistency (like Coda) is now data now. Public cloud not that trust worthy.
ZZFS key idea: turn on devices that are off!
Human factor important!
Hardware: NIC always on transparent to applications
System design:
meta-deta service: flat, device transparent namespace. Single instance, cached on all device
I/O director: where to read/write data (different polcy can apply)
write through to all copies. If device nto available, may write to a log.
Evaluation: read latency to retrive a song when device is shutting down: 23s
write latency when device not availabe and offload (0.2-1s latency)
Q&A:
Q: Some noise in measurments
A: Comes from a bad wireless router maybe?
Q: public cloud really not trust worthy?
A: We are just offering alternatives
Q: more details in cache?
A: Sychrounously and searized writes. And we build upon sth. which has strong consistency.


Revisiting Storage for Smartphones
Hyojun Kim, Nitin Agrawal, and Cristian Ungureanu, NEC Laboratories America

My takeaway: for mobile app, I/O (esp. random I/O) matters! Not just network you should worry about. And don't buy kingston SD card :p

Best paper!
Loading time bad for mobile users! But aren’t netowrk and CPU real problem?
Background: network considers improtant, CPU/graphics also consider improtant, but storage performance impact not well understood.
Why storage is a problem?
1. random I/O worse than seq with flash storage (even worth than wireless. But flash classified based on sequential throughput
2. app use random I//O
Measurements:
1. storage affects performance!
2. SD speed class not really indicative
Experimental Results:
Performance varies for different SD cards! (Kinston 200x slower!) Even faster wifi, not faster app runtime.
2x sequential writes than randome wrties, but 4-1000x random/sequential performance difference on SD card, so random access bottleneck
Why so muich random I/O: some data in FS (cached), and some data written to SQLite in sync
Thus placing DB will improve performance, or just disable SD sych mode...
Solutions:
RAID-0 over sd?
Log structured file system?
App specific selective sync?
Q&A:
Q: reminds me of Wisconsin paper. Are we now dealing with stupid applications?!!!
A: app written in modular way, and become performance oblivious. There are tradeoffs, and we need to be a bit more flexible.
Q; does performance matter? Maybe UI more important for end-to-end experience.
A: We actually wait for app to load data.





Serving Large-scale Batch Computed Data with Project Voldemort

Roshan Sumbaly, Jay Kreps, Lei Gao, Alex Feinberg, Chinmay Soman, and Sam Shah, LinkedIn Corp


My takeaway: didn't quite understand this one...seems like storage tailered for distributed key-value store?

batch computed algorithm, map-reduce (hadoop)
Full refresh output instad of incremental
Voldemort: distributed key-value system. Baically Dynamo’s clone
Custom Voldemort Storage Enginee
ransform hadoop output to construct store.
a folder for every store, inside store, keeps multple verisons of data

FAST'12 Session 5: OS techniques

FIOS: A Fair, Efficient Flash I/O Scheduler
Stan Park and Kai Shen, University of Rochester

My takeaway: flash I/O scheduler should be adapted to SSD. Details about how to adapt found in paper.

high performance easy, fairness more concern
Linux schedular: lack flash awareness and anticipation support. Or too aggressive anticipation. Dosn’t support SSD
Observation: read fast and no variasion, write slow and vary
Policy: time slice management, acount for read/write asymmetry, anticipation and paralisim
Prefer reads over writes, linear cost model,
Anticipation not for performance, but for fairness
Evaluation: I/O slow down (latency?) considerbly less, and fair.

Q&A:
Q: What’s your definition of fairness (equal latency or equal throuput?)
A: Latnecy.
Q: Then that’s not really fair...
Q: This schecular not limited to files?
A: No, it’s more general.
Q: fairness with different priority class?
A: more time slice to pioriy process will work.









Shredder: GPU-Accelerated Incremental Storage and Computation
Pramod Bhatotia and Rodrigo Rodrigues, Max Planck Institute for Software Systems (MPI-SWS);Akshat Verma, IBM Research—India

My takeaway: Dedup being offloaded to GPU, and you need cleaver design for it

Motivation: dedup to store big data, incremental storage/computation, thus processing data in storage system bottleneck.
Use GPU to accelerate it.
However, GPU still can’t match bandwidth (GPU designed for computing intensive task instead of data intensive task). Thus we need new design for data intensive task
Basic desgin (transfer data to GPU for chunking, then transfer out) doesn’t scale:
1. host-device communication bottlenneck.
solution: asychrounous execution (start GPU computation earlier than you transferred data)
pinned circular ring buffers to adress that you need pin memory in host
2. device memory conflict (multiple threads in GPU contents moory bank)...
solution: coalescing memory? didn’t understand...
Evaluation: 5x spped up compared to multi-core, matches I/O bandwidth
Case study: incremental Map-Reduce with content based chunking.
Q&A:
Q: Even in HPC community, they still deal with data intensive tasks. How do you diferriciate your work?
A: O(n^2) v.s O(n)









Adding Advanced Storage Controller Functionality via Low-Overhead Virtualization
Muli Ben-Yehuda, Michael Factor, Eran Rom, and Avishay Traeger, IBM Research—Haifa;

My takeaway: VM a sweet spot to add storage controller?

Motivation: add support to new storage controller?
Options: deep integration (in OS) - hard
external gateway - low performance
VM gateway (their approach), need to adjust VM behaviour though.
Q&A:
Q: assign cores statically? What about VM belongs to differrent companies?
A:
Q: difference from virutual storage appliance?
A: ….

FAST'12 Session 5: OS techniques

FIOS: A Fair, Efficient Flash I/O Scheduler
Stan Park and Kai Shen, University of Rochester

My takeaway: flash I/O scheduler should be adapted to SSD. Details about how to adapt found in paper.

high performance easy, fairness more concern
Linux schedular: lack flash awareness and anticipation support. Or too aggressive anticipation. Dosn’t support SSD
Observation: read fast and no variasion, write slow and vary
Policy: time slice management, acount for read/write asymmetry, anticipation and paralisim
Prefer reads over writes, linear cost model,
Anticipation not for performance, but for fairness
Evaluation: I/O slow down (latency?) considerbly less, and fair.

Q&A:
Q: What’s your definition of fairness (equal latency or equal throuput?)
A: Latnecy.
Q: Then that’s not really fair...
Q: This schecular not limited to files?
A: No, it’s more general.
Q: fairness with different priority class?
A: more time slice to pioriy process will work.









Shredder: GPU-Accelerated Incremental Storage and Computation
Pramod Bhatotia and Rodrigo Rodrigues, Max Planck Institute for Software Systems (MPI-SWS);Akshat Verma, IBM Research—India

My takeaway: Dedup being offloaded to GPU, and you need cleaver design for it

Motivation: dedup to store big data, incremental storage/computation, thus processing data in storage system bottleneck.
Use GPU to accelerate it.
However, GPU still can’t match bandwidth (GPU designed for computing intensive task instead of data intensive task). Thus we need new design for data intensive task
Basic desgin (transfer data to GPU for chunking, then transfer out) doesn’t scale:
1. host-device communication bottlenneck.
solution: asychrounous execution (start GPU computation earlier than you transferred data)
pinned circular ring buffers to adress that you need pin memory in host
2. device memory conflict (multiple threads in GPU contents moory bank)...
solution: coalescing memory? didn’t understand...
Evaluation: 5x spped up compared to multi-core, matches I/O bandwidth
Case study: incremental Map-Reduce with content based chunking.
Q&A:
Q: Even in HPC community, they still deal with data intensive tasks. How do you diferriciate your work?
A: O(n^2) v.s O(n)









Adding Advanced Storage Controller Functionality via Low-Overhead Virtualization
Muli Ben-Yehuda, Michael Factor, Eran Rom, and Avishay Traeger, IBM Research—Haifa;

My takeaway: VM a sweet spot to add storage controller?

Motivation: add support to new storage controller?
Options: deep integration (in OS) - hard
external gateway - low performance
VM gateway (their approach), need to adjust VM behaviour though.
Q&A:
Q: assign cores statically? What about VM belongs to differrent companies?
A:
Q: difference from virutual storage appliance?
A: ….

fast'12 Session 4: Flash and SSD's Part I


Reducing SSD Read Latency via NAND Flash Program and Erase Suspension
Guanying Wu and Xubin He, Virginia Commonwealth University

Missed that talk….:(






Optimizing NAND Flash-Based SSDs via Retention Relaxation
Ren-Shuo Liu and Chia-Lin Yang, National Taiwan University; Wei Wu, Intel Corporation

My takeaway: write faster with SSD, you get more performance but more error rate (you can correct them, but it reduces retention time)
Reliablity decreases as density increases
Retention relaxtion: as app don’t require long retention typically
Model Vth distribution: Old: flat + fix sigma Gaussian
New: sigma grow with time
Then they show the tradeoff between retention time and bit error rate using a curve.
Realistic workload: short retention time typically
System design: classify host writes(high performance,low retention time) and background writes(low performance, long retention time as contain cold data)
Mode selector + retention tracker (reprogramming a block when the block is about to run out of retention)
Evaluation: 2x speedup, 5x for hadoop.
Q&A:
Q: concerned about the retention time requirement conclusion you drawn. As 30% data is never touched, as they should be last!
A: that’s why we have the background migration.
Q: If you can guarantee the data need long retention was saved with normal retention.
A: Every week data is converted to long retention mode
Q: What’s the overhead of cleaning/converting
A: We measured using realistc trace.






SFS: Random Write Considered Harmful in Solid State Drives
Changwoo Min, Sungkyunkwan University and Samsung Electronics; Kangnyeon Kim, Sungkyunkwan University; Hyunjin Cho, Sungkyunkwan University and Samsung Electronics; Sang-Won Lee and Young Ik Eom, Sungkyunkwan University

My takeaway: seems like they just implemented log structure file system with hot/cold data differeication on the fly? Anyway, it works well on SSD. I am too sleepy and could missed some important points....

New file system for SSD
Background: sequential writes better than random writes
Optimization options: SSD H/W – high cost
FTL more efficient address mapping schemes – no info about fs, not effective for the on-overwrite fs
Applications SSD aware – lack of generity
That’s why fs approach
When writing in 64MB, reaches sequential performance.
So log structured file system, segment size is multiple of erase block size.
Eager on writing data grouping – differrciate hot and cold data. They categorize cold/hot data at run time.
Colocate blocks with same hotness into same segment when they first written
Hotness = write count / age
Segment hotness = mean write count of live blocks / how many live blocks
Some details about how to divide segments into different “hotness” groups.
Evaluatin: outperformans LFS, as segument utilization distribution is better (more empty and full segments)
Reduces block erase count inside SSD (thus prolongs lifetime of SSD)
Q&A:
Q: Ever though about compression?
A: key-insight is transforming randome writes into large sequential writes.
Q: why not compare your schem with DAC?
A: In our paper we compared
Q: why 64M chunk size?
A: based on SSD property.
Q: in practice SSD is smart and optimize for random write. They may break your segments inside?
A: good question! We measured and found that
Q: avaialbilty of SFS? Going to be a product?
A: no plan yet. Now open source.
Q: You cache until you have 32MB of data to write? But some app use sychrounous writes
A: synchronization is hard for SFS. We write as many as possible, and write all the remaing blocks.

2012年2月15日星期三

FAST'12 Session 3: File System Design and Correctness

Recon: Verifying File System Consistency at Runtime
Daniel Fryer, Kuei Sun, Rahat Mahmood, TingHao Cheng, Shaun Benjamin, Ashvin Goel, and Angela Demke Brown,University of Toronto

My takeaway: checking consistency at runtime is interesting, and transforming gloable file system property into local invariants in also interesting.

Bugs corrupts in-memeory file metadata

Current solution: assume file systems are correct. Offline consistency check (fsck): slow, require fs offline, repair error prone

Their solcution: Rcon, runtime consistency check

Key idea: every update results in a consistency fs image. (Disk can still conrrupt? Checksum handling that below fs?)

Transform gloable consistency properties to fast, local consistency invariants.

When to check? (Don’t want to check during operation): right before you write journal commit block.

System design: in block device layer. buffer metadata writes (write cache), interpreting metadata, then compare to old data (read cache), check for invariants

How to interpreated: use fs tree structure, just follow the pointer. (Need to understand inode block structure)

Evaluation:

Inject corruption. Recon can detect corruptions not detected by fsck.

8% performance penalty (mainly due to cache misses)

Q&A

Q: What to do after dectecing corruption?

A: fail-stop. (maybe retrying if failure transist?)

Q: What happen if delayed commits?

A: ext3 does that. (large transaction)

Q: future file system to make check easier?

A: back pointers (less data to keep track of). Maybe write consistency in declarive language.

Q: you are delaying writes. What about persistence requirement?

A: we only hold commit block. It increase synchronize write latency. If you sychornzily write, you are wrting commit block every time, and you are paying the cost all the time.

Q: apply this to other things? Distributed system?

A: Consistency in DS in more complex. Maybe DB which has fs structure to maintain, or some other transactional thing?

Q: why not inside FS?

A: We don’t depend on FS state correctness! We do rely on FS structure, but it changes slowly!










Understanding Performance Implications of Nested File Systems in a Virtualized Environment
Duy Le, The College of William and Mary; Hai Huang, IBM T.J. Watson Research Center; Haining Wang, The College of William and Mary

My takeaway: guest/host fle system combination matters!

How to choose guest/host file system combination.

Macro level: Measure throughput and latency: combination choice matters!

Writes more critical than READs

Latency is more senstenve than throuput.

Micro level: random/sequential read/write

Read unafftected by nested fs, while writes affected.

Readahead at the gypervisor when nesting FS

Long idle times for queuing

I/O scheduling not effective on nested fs

Effectivenss of gues fs’ block allocation is NOT guaranteed.

Advice:

Read-dominated workloads: doesn’t matter. Sequential read may even improve with nested fs (look ahead)

Write dominated workload: avoid nesting, causing extra metadata operations.

Latency sensitive workloads: latency increased.

Dala allocation: better pass through.

Q&A

Q: did you put ext2 partition at different location of disk?

A: No. We didn’t.

Q: Then you didn’t isolate the effect of disk zone properties!

A again: we tried access different zone, performance difference within 5%.

Q: container file preallocated? Upperlevel fs make direct I/O to bypass page cache of host?

A: Yes. Yes.

Q: We don’t typically use i/o scheduler in guest.

A: default I/O scheduler in either guest/host.

Q: would your finding generalize to another layer of management?

A: didn’t think about it.

Q: something about cache flush. Didn’t understand….








Consistency Without Ordering

Work from our group.

Q: what’s the memory overhead?

A: only extra bitmap store. So only one bitmap block per 4096 blocks.

Q: how large is the file system. As for large fs, scan time increase. How about almost full fs? Finding a free block costs! There are full fs!

A: Common case fs not so full. For ful file system, not the best approach.

Q: where do you store back pointers.

A: OOB in future disk

Q: strong consistency guarantee? Not as strong as some file systems?

A: We provide data consistency: when access data, data blongs to this file, but could be stale data. Ext3 provides stronger consistency. On disk image may not be a image of a file system that existed.

Q: other problem could solve?

A: any system you have hiarachy (parents and child).

Q: CPU overhead?

A: we looked at it. Not too much different from ext2. We don’t know how many extra circles though.

Q; backpointers removed when file being deleted?

A: Currently lazy deletion. So for short no! We rely on mutual pointer agreement..