2013年10月11日星期五

RAMCloud and the Low-Latency Datacenter

RAMCloud and the Low-Latency Datacenter
John Ousterhout, Stanford

Low-latency Datacenter in General:
     Datacenter: large compute power + large amount of storage + fast communication, homogeneous, centrally controlled
      How many datacenters: $1K-10K per person for IT infrastructure?
      Innovations in Datacenters:
               Phase 1: manage scale (map-reduce)
                Phase 2: low latency (from 1-2ms to 2-10 us)
                                 In current datacenter, application is separated from its data (unlike in a single machine)

RAMCloud (Storage system for low latency datacenter)
Goal:
1.       All data in DRAM (no cache misses)
2.       Durable and available
3.       Large scale
4.       Low latency (5-10us remote access) for small chunks of data

Architecure:
                App + network + storage servers + coordinator
                Storage servers: master + backup. Master manages memory and data requests, backup manages secondary storage and durability, talking to other masters for availability etc.
               
Operations:
                Read, write, delete, cwrite(conditional write based on version number)

Data Durability:
                Eventual durability, log structure approach (with small non-violtial memory).
                Crash recovery: log replaying in large scale (?)
Log-structured memory:
            Don’t use malloc (waste at least 50% memory because of workload change), instead, structure memory as a log (80%-90% memory utilization)
New Projects:
                Higher level data model  (for database, say, secondary indexes?)
                System architecture for RPC
               
Threats to Latency:
                Layering (how to organize software structure but still low latency?)

                Network buffering (esp. TCP just tries to find empty buffers!)

2013年10月7日星期一

Differentiated Storage Services

Intel Labs
SOSP 2011

Key Idea:
1. Classify I/O requests. Classifiers preserved across different layers: applications, file systems, and block storage systems.
2. Division of labor for I/O performance optimization:
     File systems or Applications: develop classification schemes based on its own semantics (how to classify, policy for each class)
     Operating Systems: Preserve classifiers when passing/translating I/O requests.
     Block Storage System: Implementing the polices using its internal mechanisms.

Example:
    File System classify based on block types, and assign priorities to each block class
    Storage systems implement priority policies by giving cache preference to higher priority blocks (thus no priority reversion)

Comparison of DSS and IOFlow:
     1. IOFlows give differential treatment to VM level I/O flows, while DSS give differential treatment to each I/O request. (Different Granularity)

Reliability/Corrutpin model and some key-value store buiding stuff


Zettabyte Reliability with Flexible End-to-end Data Integrity

Data corruption could go undetected, thus high-level (end-to-end) integrity needed
Checksum used (strong checksum needed)
Drawback:
         Performance bad (need to compute checksum) – to change checksum online
Detection to late (want to detect before it goes to durable storage!) --- solved by let every component knows about the checksum

         They use corruption and checksum model to model the undected corruption probability (for a b-bit block)
        
Zettabyte reliability?????
         Less than 3.46*10e-18 for 4KB block (17.5 score)

Improving disk reliabity doesn’t improve overall reliability that much (maybe because disk corruption prob is already small???)


How about add compue overhead as a parameter?
        























PUE: (factory to data center. Used for transformer, etc.)
2005: 2-3 2012: ~1.1

Proportionality: (when idle, don’t use power, when computing in full load, use full power)

Effergy efficiency: 1GHz sweat spot

Increasing speed cost once:
1.       Once for switching speed
2.       Once for memory wall (caching, prefeching, out-of-order execution)

So, sweat spot configuration (wimpy nodes)
1.6G Dral Core
32-160G flash SSD
Only 1GB RAM

Design key-value store from the very bottom (hardware) up
Fast front end cache (cuckoo caching?)
Backend: log structure data structure + hash table index (instead of using file system)
Partial-key caching (complete key stored along with data to deal with collision) to enable memory efficiency

Then hardware change: (CPU 6x, memory 8x, SSD: 30-60x)---so CPU and memory have to keep up
How do you minizie memory per entry???
Static external dictionary theory problem
EPH – 3.8bits/entry
Entroy-coded tries: 2.5 bits/entry

Think it as a pipeline!! (how??????)
They shadow writes then batchly coping them to main index(?)
So they can only support up to 10% puts

Problem: linux kernel I/O stack too high
Load balancing:  add cache in front end to deal with hot spot
Proof: only nlogn cache entries needed for n loads to achieve almost perfect load balancing
So only need L3 for caching!
Intuition: (didn’t understand)
        
Their solution can’t deal with hash hacking (assume hash function invisible)


Some tradeoff: more reads to avoid some writes?
Bottleneck always in I/O? and also flash performance bursts

They want to manage raw flash, and treat it as many sequential writing devices(???????)
Now they have an SCSI command to exchange (remap) mapping on SSD, and they can do cool things using that





Korean Android Trace/Measurement

They build a FTL simulation tool, and also back-mapping from block number to inode number to file path.

Using that we ware able to measure:

1. Write size (sectors) and write requests distributions on file types and application.
    For almost every applicton:
    -- ext4 journal writes are the most
    -- then android and database temporal file I think

2. Avg overwrites on a block distribution on file types and application
   for each application:
     -- many overwrites on the same block for ext4 journal files and database/database journal files

3. Avg write size per request based on file types and application
    for each application:
     -- 1.x blocks per write request for almost all kinds of writes
     -- 6-7 blocks per write for ext4 journal

4. Traffic source --- how much data is issued for write for each type of file (per application)
     -- for every 1kb user data, seems like 3kb of android&kernel temporal data is written, and 2kb FTL garbage collection & FTL other writes

The science of crash consistency

barriers are not used
but we usually get away with it becuase journal is sequentially written

consistency is not a binary thing -- probabilistic crashconsistency
What's the probability of inconsistency upon a crash?

P = time in (inconsistency) window / total time
(They assume/observe that for a give storage configuation, there are not many ordering obseved. So they average these out)

Different re-ordering:
commit-commit: second transaction before the first transaction
commit-data: data block before transaction commits
commit-metadata: metadata block before transaction commits
checkpoint-commit
checkpoint-data
checkpoint-metadata



IO Priority: Related Work


Labeling I/O requests globally:


Differentiated storage services 
SOSP 2011 M. Mesnier, F. Chen, T. Luo, and J. B. Akers.
Applications label IO requests. The label then propagates with the requests from the file system to a block device using the SCSI protocol.

Diagnosing performance changes by comparing request flows. NSDI 2011
Diagnosing performance changes by comparing request flows SIGMETRICS 2006
Explicit labels for IO requests in distributed system.


QoS for Storage Systems:

A Fresh Approach to File System Quality of Service NOSSDAV 97
A Practical Learning Approach for Dynamic Storage Bandwidth Allocation IWQoS2003
Providing QoS guarantees for disk I/O. Multimeida Systems, 57-68 Feb 2000
Travelling to Rome: QoS specification for automated stoarage system management, IWQoS 2001


Multi-tenant storage control:


Proportional allocations of resources for distributed storage accesses. FAST 2009
Proportional share scheduling for distributed storage systems. FAST 2007

Performance Insulation for Shared Storage Servers. FAST 2007

Performance Differentiation for storage systems using adaptive control. ACM Transactions on Storage

Performance Isolation and Fairness for multi-tenant cloud storage
David Shue, Michael J. Freedman, and Anees Shaikh.
OSDI, 2012
   
System level max-min fairness at the key-value store level.
Assume well-provisioned network, do not deal with in-network resource sharing.  
They mainly deal with: where to place datasets, where to replicate datasets, how to allocate weights to each node, and how to do faire queuing locally to ensure DRF even though some datasets are more (dynamically) popular than others. (They use DRF so they do deal with multiple resources, namely bytes received, bytes sent, and requests. bytes bound by network/storage, requests bound by interrupts???)

IOFlow: A Software-Defined Storage Architecure

from Microsoft Research and Microsoft
SOSP 2013

This is the closest architecture to what I imagine Software-Defined Storage should be like.
They borrowed the whole set of ideas from SDN, especially the separation and programmability of the control plane.

Key Idea:
1. Define IOFlows as {VMs, operations, files, shares}, basically source, destination and type (read/write/create)
2. A language to describe end-to-end policies over different I/O flows (bandwidth, priority, middleboxes etc.)
3. Define stages as different I/O request processing points (storage drivers, NICs, etc.), and expose a standard queuing abstraction. (analogy to switches and openflow in SDN)
4.  Centralized controller to realize policies by specify queuing rules at different stage (analogy to SDN controller, distributed state abstraction is also realized here). The rules are updated periodically based on running statistics.

A lot of challenges can then be solved by directly using SDN-based techniques.

Challenges Specific for Storage:
1.  No standard I/O request header (like network packet header)
     Solution: controller maps flow names to low level identifiers (files, blocks, etc.)
2.  I/O requests, unlike network packets, have different types (read/write/create), which requires different treatment
     Solution: releasing tokens based on end-to-end cost of operations. This cost is estimated by running benchmark to get the cost of I/O request as a function of their type and size.

Unsolved Problems:
1. In order to realize some policies, need to enforce rules at EVERY I/O processing point. (e.g, HOL blocking for priority)
    For example, file systems(journals), I/O schedulers, network switches, distributed file systems, etc. How to enforce policies at these places, how do they work together? They didn't address this.
    (The unsupported layer problem could be alleviated by controlling the numbers of outstanding I/Os in those layers though)
2. Didn't consider the impact of network.
    They assume a good enough network (40G/s), didn't consider the possibly complicated storage/network interactions.
3. Doesnot consider the location information of IO requests.
4. Need to operate on the VM granularity, no fine control.
5. This kind of queuing model doesn't work well with cache mechanism.