2012年10月25日星期四

network session in OSDI'12 (likely involving how system using network/SDN suff?)


Tuesday, October 9, 2012

9:00 a.m.–10:30 a.m.Tuesday

Distributed Systems and Networking

Ray Dolby Ballroom 123
Session Chair: Jason Flinn, University of Michigan

Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE

Zhenyu Guo, Microsoft Research Asia; Xuepeng Fan, Microsoft Research Asia and Huazhong University of Science and Technology; Rishan Chen, Microsoft Research Asia and Peking University; Jiaxing Zhang, Hucheng Zhou, and Sean McDirmid, Microsoft Research Asia; Chang Liu, Microsoft Research Asia and Shanghai Jiao Tong University; Wei Lin and Jingren Zhou, Microsoft Bing; Lidong Zhou, Microsoft Research Asia
View the slides
View the video
Listen to the mp3

MegaPipe: A New Programming Interface for Scalable Network I/O

Sangjin Han and Scott Marshall, University of California, Berkeley; Byung-Gon Chun, Yahoo! Research; Sylvia Ratnasamy, University of California, Berkeley
View the slides
View the video
Listen to the mp3

DJoin: Differentially Private Join Queries over Distributed Databases

Arjun Narayan and Andreas Haeberlen, University of Pennsylvania
View the slides
View the video
Listen to the mp3

2012年10月24日星期三

Storage-network interaction


Measuremnt and Analysis of TCP Throughput Collapse in Cluster-based Storage System
CMU, 2007, tech report

Network incast problem caused by storage system reading in different blocks in parallel.



A cost effective, high-bandwidth storage architecture. 
ASPLOS-VIII, 1998


High Performance NFS: Facts & Fictions
SC'06



The panasas activescale storage cluster: Delivering scalable high bandwidth storage
SC'04

Remote Direct memory access over the converged enhanced ethernt fabric: Evaluating the options
HotTI'09
D. Cohen, T. Talpey, A. Kanevsky, U. Cummings, M. Krause, R. Recio,
D. Crupnicoff, L. Dickman, and P. Grun


[6] C. DeSanti and J. Jiang. Fcoe in perspective. In Proceedings of the 2008
International Conference on Advanced Infocomm Technology (ICAIT
’08), pages 1–8, Shenzhen, China, July 2008.


Network support for network-attached storage
Hot Interconnects, 1999

incast:
http://www.cs.cmu.edu/~dga/papers/incast-sigcomm2009.pdf

outcast:
https://engineering.purdue.edu/~ychu/publications/nsdi12_outcast.pdf

2012年10月21日星期日

Software defined storage

Windows Azure Storage: a highly available cloud storage service with strong consistency:
Microsoft Research
SOSP 2011

GFS-like stream layer. They added another layer on top of it to implement Blobs, Tables, Queues (for reliable message) and Drives (NTFS volume) abstractions.
Focused on load balancing (by their fancy index techniques?) and consistency protocol.
Scale computing separate from storage: nice for multi-tenant, bi-sectional environment  bad for latency/bandwidth from storage. They didn't talk about how Azure Storage stresses their network though, just general load balancing. 



Argon: performance insulation for shared storage servers
Matthew Wachs, Michael Abd-El-Malek, Eno Thereska, Gregory R. Ganger
Carnegie Mellon University
FAST 2007 

Performance isolation for storage systems. However, they focus on the I/O bound, by assigning time quota to use disk, combined with pre-fetching etc.
For nonstorage resources like CPU time and network bandwidth, they claim that well-established resource management mechanisms can support time-sharing with minimal inefficiency from interference and context switching. 


The Panasas ActiveScale Storage Cluster: Delivering Scalable High Bandwidth Storage:

FIXME: read it!




Proportional-Share Scheduling for Distributed Storage Systems
UMich/HP Lab
FAST 2007

Assumed dedicated storage system (client v.s. data nodes), a variation of fair queuing to serve requests. In a sense similar to Fair cloud, which shedules requests to key-value store, however no replication/migration/multiple resources type etc. so flexiblity of this scheme is limited. They do it in a distributed way, v.s., fair cloud uses a central scheduler. 

They assume network is good enough. 


Ursa Minor: versatile cluster-based storage
CMU
FAST 2005

Online change (software defined, late binding, whatever you call it) and adaptive management to data encoding (replication or parity etc.) and fault model (how many stop-fail failure and how many byzantine failure to tolerate). Not quite sure how they learn it though??


Others:
Storage Virtualization / Software Defined Storage:

SNIA Technical tutorial on Storage Virtualization from 2003 -
http://www.snia.org/sites/default/files/sniavirt.pdf
SANRAD white paper about snapshots and Global replication etc with
storage virtualization -
http://www.sanrad.com/uploads/Server_and_Storage_Virtualization_a_Complete_Solution.pdf




Cloud database / file systems:

Smoke and mirrors: reflecting files at a geographically remote
location without loss of performance -
http://dl.acm.org/citation.cfm?id=1525924

 
Cassandra - A Decentralized Structured Storage System -
http://dl.acm.org/citation.cfm?id=1773922


API:
Cassandra offers a key + structured object (which consists of multiple, hierarchical column families) data model, and provides atomic operation per key per replica, no transactional guarantee here. Namely, insert row, get column of row and delete column ofrow are the main API. 
 
Distributed aspects:
Cassandra is a decentralized (chord-like) distributed storage system, which uses chord-style consistent hashing for key partitioning and chord-style replication management (replicate to next N-1 nodes in the ring, but possibly rack aware). . All standard distributed techniques here. One thing interesting:
They use an accrual failure detector to detect node failure, and observes exponential time interval distribution for gossip message arrivals.

Local persistence mechanism (all sequential):
1. Data is first persisted to a commit log (to ensure durability) and only after than changes are applied to in-memory data structure. This commit log (or journal) is written sequentially to a dedicated disk, and, of course, a lot of fsync. 

2. They write in-memory structure to disk also in a sequential-only fashion, i.e., every time memory is full, a bunch of new files are generated, instead of modifying and argumenting existing file. More specifically, one file per column family, one file for primary key, and also an index file for the primary key. (Sort of column-orientated storage, but not entirely).  After this, the corresponding commit log could be deleted.

3. They combine these generated files into large files periodically. 

 
The Case for RAMClouds: Scalable High-Performance Storage Entirely in
DRAM - http://dl.acm.org/citation.cfm?id=1713276



How little improvement in data center infrastructure enables you to do interesting things



Spanner: Google’s Globally-Distributed Database
OSDI 2012's best paper

They implemented a semi-relational database, big-table like, but much stronger (external) consistency.

My take away:
Improving data center timing system and having a global clock makes consistency much easier.

True time API is really nice! Google is right in that absolute time is much easier to manage than relative time (e.g., Lamport clock)


Flat Datacenter Storage:
Microsoft Research
OSDI 2012, 

This is basically a re-implementation of GFS on a full bi-sectional, bandwidth matches disk bandwidth data center network. (Thus flat, locality oblivious) 

Data placement is deterministic (using a hash function), doesn't consider locality

They push meta-data management into each data node, while only keep node information in a central server. Replication/recovery a bit Chord style, I think?

Other than that, not too much difference from GFS. This is the only work (that I know of) which talks about how storage system uses network though: short, bursty flows, bimodal distribution in packege size (small control message, large data block). not good for TCP, so they use DCTCP?



Multiple Resource Scheduling



Multiple Resource Scheduling:



Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion Stoica
University of California, Berkeley
NSDI, 2011

Max-min fairness on dominate resource share.
Motivated by data center Hadoop/Dyrad execution.
Max-min fairness satisfies four important properties:
Share-incentive, strategy-proofness, envy-freeness, pareto efficiency

My comment:
sharing between non-collaborating, competitive users.



Multi-resource fair queueing for packet processing

UC-Berkeley
SIGCOMM, 2012

They implemented DRF in a time-sharing manner (instead of space sharing), mostly motivated by mulitple resource sheduling in Middleboxes. 



Performance Isolation and Fairness for multi-tenant cloud storage
David Shue, Michael J. Freedman, and Anees Shaikh.

OSDI, 2012
   
System level max-min fairness at the key-value store level.
Assume well-provisioned network, do not deal with in-network resource sharing.  
They mainly deal with: where to place datasets, where to replicate datasets, how to allocate weights to each node, and how to do faire queuing locally to ensure DRF even though some datasets are more (dynamically) popular than others. (They use DRF so they do deal with multiple resources, namely bytes received, bytes sent, and requests. bytes bound by network/storage, requests bound by interrupts???)

FAST'12 Session 1: Implication of New Storage Technology


Session 1: Implications of New Storage Technology

De-indirection in SSD with nameless writes

from our group

Q&A:

Q: SSD is not hard disk drive. Why not expose SSD internals to file systmes?
A: Let vendors control SSD internals

Q: How about asscociate data in calbacks?
A: in OOB

Q: Why not richer interface? Hints to device maybe?
A: That could be useful.

Q: More interesting with BrtFS?


The Bleak Future of NAND Flash Memory
Laura M. Grupp, University of California, San Diego; John D. Davis, Microsoft Research, Mountain View; Steven Swanson,University of California, San Diego

My takeaway: SSD not replacing HDD, tradeoff must be made to increase capacity and such.

Flash memeory case study. They looked at capacity, latency and througput.
How to increase density: multi-bit cells, Moore's Law.
Use them to predict future density: 1.6T in 2024 at best?

Latency: SLC-1, MLC-2, TLC-3, higher capacity, larger latency!
So latency likely to increase in the future (3ms for 1.6TB for TLC-3?)

Throughput: for fixed size capacity, throughput for TLC/MLC-2 far worse than SLC-3 (0.7x)
IOPS: 0.4x (32k, for HDD it's 0.2k)

Conclution: not so greater compared to HDD (in some cases!)

Q&A:

Q: Future doesn't seem so bleak?
A: SSD don't just "get better". Tradeoffs instead of straightly got better.

Q: Power characteristics?
A: Didn't study

Q: Lifetime for SLC-1, MLC-2 and TLC-3?
A: drop form 10,000 o 500!


When Poll Is Better than Interrupt
Jisoo Yang, Dave B. Minturn, and Frank Hady, Intel Corporation
My takeaway: well, everybody know poll is better when ops are fast...But they talked in detail how asych
NVM and future SSD made of NVM: fast, use up of PCI bus bandwidth
Traditional approach (asynchronous model)
I/O request submitted to device, SSD interrupts with IO competition. (CPU free while doing I/O)
Synchrous model:
Bypass kernel block I/O layer, send request directly to device and poll. (CPU busy polling while doing I/O, only beneficial when device fast)
Prototype: NVM Express interface (really fast! 4 us per 4K)~
Measurements shows that synchronous model faster!!!
Futher issus with Async I/O
1. Device undertuitlized.when IOPS pressed (why??????)
2. Interrupt overhead: can be reduced by coalescing, but increase latency
3. Negative on cache and TLB thrashing
Implication:
Non-blocking i/o useless
Rethink I/O buffering (esp. I/O prefetchiing) why????
Q&A:
Q: Multi-thread implication?
A: dedicated pooling loop in current implementation.
Q: how about if the request is long? CPU polling for 5-10 ms???
Q: ????
Q; according to last talk, are we going to get that latency you are assuming????
A: last talk in about NAND, not the same thing?
Q: even with polling, OS overhead is big (50%). Should we free OS completely? Saying doing I/O in user-space or with GPU?
A: maintaining current interface is nice.
Q: make use of concurrency, oen thread doing polling to get potential benefit?
A: depends on app logic. And blahblahblah….
Q: overhead breakdown? (context switch time? You are using make_request instead of request function kernel provides!)
A: refer to other paper….