2013年12月13日星期五

Mahesh Balakrishnan's recent work on SSD-cache

This guy works in Microsoft Research, and has done a lot of work which use SSD as a cache...

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage
FAST'13

Motivation:
Multiple VMs issuing I/O workload to underlying storage, which mixes up sequential and random accesses, or different sequential streams, causing disk to see only random accesses, thus degrade I/O performance severely.

Inadequate solutions:
Re-ordering I/O: increase latency, in-complete
Workload placement: requires prediction, maybe inaccurate, not flexible
LFS: garbage collection kills performance

Observation:
A single sequentially accessed disk is better than multiple randomly seeking disks.

Solution:
Instead of striping, they chain multiple disks into a cycling log. Log-structure style write always happen in the tail; do garbage collection in the head, thus separate random and sequential accesses streams. Reads could still happen all over the disks though, they have an SSD as cache to alleviate read problems and speed up writes. Metadata, such as reverse index etc., are also put in SSD.

Performance:
They have consistent performance, ~100 MB/s. Traditional stripping usually get 200 MB/s, but during garbage collection (about 30% -50% of the time) would be below 50 MB/s

My comments:
Another way to solve this multiple I/O workload mingled together problem is to have VMs collaborate with each other. VMWare's mClock (OSDI'10) and Padra (FAST'09) has a little bit this flavor, but focused on fair sharing instead of sentimentality optimization. Might be an interesting problem to look at...



Tango: Distributed Data Structures over a Shared Log
Microsoft Research and Mircorsoft
Presented by Vijay

This is built on the same principle of log structure file system, but is distributed. It is built on SSD arrays thus random reads are fast (necessary for a log structure)

They keep a global sequence number to achieve serviceability. However, writes of different client and different sequence number will occur at different location of the log, thus could happen concurrently.
Transaction is implemented in an optimistic fashion:
A transaction starts with a transaction begin block in the shared log, followed by a read block which records the version number of all the objects this transaction is going to read (thus depend on), then followed by whatever updates this transaction made. Then every client could commit this transaction if the objects' version in the read block hasn't changed, without any coordination. If the objects in the read block has been updated, then the transaction just abort and restart again.

They claim that garbage collection is not a big problem because on flash cards you can do thousands of operations per second even with garbage collection going on.

This makes it easier to build distributed data service. They have built ZooKeeper in ~1000 lines of code on top of this. It is also likely to have higher performance because in a traditional distributed consistency protocol, communication overhead is high with many participating nodes, thus the good performance provided by SSD may  not be fully utilized.

2013年12月1日星期日

Bounded gap between primes

Here is a nice summary of Yitang Zhang's talk:

Now they have locked the number to 16. However, it is hard, if not impossible, to extend Zhang's method to prove twin prime conjecture, due to the parity problem of sieve theory.

2013年11月27日星期三

Some Linux tricks

1. Error message "Invalid output format udev. Choose from value, > device, list, or full >" when installing kernel
    This is because system uses an old version of blkid, which doesn't recognized the newer udev format.
    To solve this, just install a new util-linux pacage. Sometimes yum screw up and have a lasted util-linux but blkid still not updated; just reinstall in this case: yum reinstall util-linux 

2013年11月18日星期一

MIGRATORY COMPRESSION

EMC

Coarse-grained Data Reordering to IMprove COmpressbility

compression get limited by its window size, larger window size leads to lsower compression

MC: first to coarse-grained reorganization to group similar blocks, so that you can improve compressibility with the same window size

Gzip: HUffman coding 64KB window
Bzip2: 900KB window, Burrows-WHeelses Transformation abd Huffman coding
7z: Markov chain-based range codec, 1GB window
(However, no apple to apple comprison as the don't know how to increase the window size of Gzip etc...)

Almost reverse linear relation between throughput and compression factor

Idea: improve compression by grouping similar blocks

Burrows-Wheelers: reorder within block
MC: reorder the blocks

Usecase:
1. compress a single, large file
2. Archival tier of deduplicating storage system
3. network transfers (maybe?)

Delta compression: just store how to transfor from A to A' in order to store A'. OK for single file, too complex for a deduplicating storage system, not good for locality neighter

Workflow of MC:
to compress:
segmentation (partition into blocks, calculate super fingerprints)----> similarity detector   -----> re-organizer
to un-compress:
header to specify how to re-construct file

Similarity Detection: similar to "reduendancy eleimation at block level", caluclating features by some set of rolling hashes.

Ororganization: reorganize input file, according to a recipe
1. in-memory, read in entire file
2. chunk-level: random access
3. multi-pass: in general better than chunk level

Evaluation:
At least for an in-memory ssystem, adding MC before compression never does worse than vanilla copression. If things don't fit in memory, not that much because of random accesses incurred. SSD helps though. --> Remzi: can you just limit the re-ordering scope so that it fits into memory?

Uncompression has 30% to 70% overhead in an archiveral system though, due to random access


Q: How about workload specific compression? As if you know the workload better, you may be able to compress better? MC is a pretty generic technique though?
A. I need to think about it.
     But NetAPP's vedio compression falls into that..














2013年11月15日星期五

SyNS'13 Session 6: Security

Resource-freeing Attacks: Improve Your Cloud Performance (at Your Neighbor's Expense)
Venkat Varadarajan, Mike Swift, Tom Ristenpart

Side channel stuff, take advantage of cpu scheduling etc.
Solution: better isolatin, pack-up VM and move (not all workloads cheap to move).

This work: greedy customer can get more resources by resource-freeing attack

Victim:
Beneficiary:
Helper: mounts the attack, could sit in beneficiary

Example:
Cache contention: victim webserver frequently interrupts, pollutes the cache for the beneficiary (because Xen gives higher priority to VM consuming less CPU time). Helper ask for cpu-intensive requests from the webserver, thus reduce the interrupts webserver generates, beneficiary improves

General Resource-freeing attacks:
1. Send targed requests to the victime
2. Shift resource usage away from the bottleneck

OpenNF: All your networkk functions are belong to us (almost)
Aaron Gember, Aditya Akella

NFs (or middleboxes) examine/odify packets at layers 3-7, and we have increasing deployment of software NFs.

But now NFs are treated as blackboxes, thus hard to identify bottlenecks etc.

Example:
If you scale up, you need to move flows, otherwise bottleneck may persist or you delay your scale up. But simply removing flow may lead to inconsistency. Move flows along with the internal states of the middleboxes.

Solution: OpenNF
APIs implemented by NFs so that applications can examine/update middlebox's internal states.
Problem of VM replication: unneededstat may cause incorrect actions, and cannot merge thus incapable of complex load balancing

NF state taxonomy:
Per-flow state, multi-flow state, all-flow state (statistics)

Per-flow state we can just move around,  multi-flow state we share (but synchronization expensive, so clone and lazy merging only when scaling down), all-flow state the sameas multipl-flow state. 

State-consistency: (because network could update states while we are moving things)
Suspend flow traffic and move one solution.
Packet redirect events a more efficient solution (Critical that you move the states before you update the network routing)


RiskRoute: A Framework for Mitigating Network Outage Threats
Ramakrishnan Durairajan, Paul Barford

Adjust internet routing before outage (by natural disasters) happen at rea ltime.

Does internet routing currently take advantage of the predictability of natural disasters?
Yes, but now done by hand, thus incomplete, in-efficient

Bit-Risk Miles Metric:
Assess sensitivity to network outages. Defined as
                     # of Bits Sent + Distance + Outage Risk = Bit-Risk Miles

RiskRoute methodology:
An optimization problem which minimize the bit-risk miles.

Evaluation:
1. Risk ratio: average reduction in bit-risk miles
2. Distance ratio: average increase in bit-miles.





SyNS'13 Session 5: Bugs

Toward Efficient, Portable Application-Level Crash Consistency
Thanu, Remzi

Many techniques for file system consistency (for metadata), but how about application data consistency (for user data)?

How does application achieve consistency: they rely on specific details of file system implementations.

Application-level invariants: thumbnails match pictures, etc. But hard to maintain over system crashes.

Example:
Atomic file update: create temp file, fsync temp file to disk, then use rename(). fsync() maybe left because of wrong understanding, performance sake, or because on most file systems (ext4), it is actually correct.

What to study application consistency behavior.
Methodology: case study of SQLite and LevelDB)

Properties offered by file systems: (Inspired by the disk properties sqlite rely on, I think)
1. Post-Crash property (true/false): does a system call sequence only result in given, desirable set of post-crash states? ext3-ordered: yes, ext3-writeback: yes, ext4-ordered: no, btrfs: yes
2. Ordered Appends (true, false): ext3-ordered: yes, ext3-writeback: no, ext4-ordered: no, btrfs: no
3. Ordered dir-ops: directory operations get persisted in issued order. 
4. Safe-appends: when a file is appended, the appended portion will never contain garbage
5. Safe new file: After fsync() on a new file, another fsync() on the parent directory not necessary. 

Bugs in SQLite and Level DB: all of them only exposed on certain file systems. And developer said that they don't understand file system behavior, so they had to be conservative for their implementation, thus hurting performance.

Performance:
Experiments show that 3x performance boosts if you can rel on certain file system properties to optimize application behaviour!

Future work:
Solutions which doesn't require application re-writing, system call re-ordering, maybe??? Tools to detect such bugs?


Efficient COncurrency-Bug Detection Across Inputs
Dongdong Deng, Shan Lu

Concurrency-bug detection is costly: need to take many inputs, each input has a huge interleaving space to test.  Software company cannot afford exhaustive in-house testing.

Idea:
Remove redundancy Across Inputs. Because multiple inputs could trigger the same bug. How could we remove such overlaps?

Solution:
Find the overlapping interleaving space for different inputs, given a set of inputs?
Which two inputs will give the same interleaving space? We cannot answer this question perfectly, but we can estimate. To this end, concurrent function paris (CFP) metric is proposed.

Characteristic study:
For each harmful data race, on average 7 inputs trigger it; for each benign data race, on average 4 inputs trigger it.

Approach:
1. Profile interleaving space of each input: look at functions instead of instructions, as only a few functions share memory accesses; thus look at all the function pairs which could be executed concurrently. If two inputs have similar concurrent function pairs, then they are likely to trigger the same bug.
    Naive way to detect CFP: look at every instruction in each function, see if they can run concurrently
    Efficient way: whether one function's entrance point can execute between the other's entrance and exit, by looking at locks and barriers.
2. A test plan of selecting inputs and functions.
    Based on the CFP information, use a greedy algorithm to select input which covers the most CFPs.
    Also, select functions which we test. 
3. Conduct bug detection.

Result:
4 times redundancy reduction thus speed up.


Limplock: Understandingthe Impact of Limpware on Scale-out Cloud Systems
Thanh Do, Haryadi Gunawi

Hardware fails in different ways, how about performance degradation as a failure manifest?
Facebook reports one slow NIC in one machine (1GB/s to 1 kbps) has cascading impact on a 100 node clusters. And many other stories.

Limpware: hardware whose performance is much lower than its specifications.
Study the impact of Limpware: (a single limpare has global impact, why?)

Methodology:
Run workload, inject limpware, then probe systems to understand symptoms.

Results:
1. Failures are handled pretty well (by retry etc.), but slowdowns are not!
2. Hadoop's speculative exec is not triggered. (E.g, a degraded NIC on a map node, but all reducers need to fetch data from that node. Speculative exec not triggered because all reducers are slow! In general, a degraded task leads to a degraded node, then leads to many degraded nodes)

Limplock:
 They system progresses slowly due to limware and is not capable
Level 1: Operation limplock
              caused by single point of failure etc.
Level 2: Node limplock
              caused by multi-purpose queue, unbounded queue, etc
Level 3: Cluster limplock
              
Limplock happens in all systems they analized.

Future work: Limpware-tolerant system
   1. Anticipation
   2. Detection
   3. Recovery

SyNS'13 Session 4: Measure it!

Inaccurate Spectrum Databases?
Tan Zhang, Suman

TV whitespaces are availbe, how do we use them?

Things people are already doing:
1. Rutal internet connection
2. Smart power-grid controller
3. Bus internet

But it shouldn't interfere primary TV users, thus need to determining whitespaces.

Wardriving on Public Vehicles (spectum sensor depoyed on Metro buses)

Accuracy of Commercial Databases:
1. Good for protecting used spaces
2. But too conservative thus waste while spaces.

V-scope: opportunistic
Wireless Gateway upload data to V-Scope Servers, then Serve uses internet to distribute information to participating databases.

Callenge in measuing Primary Signal:
Weak signals could be overwhalemed b strong signals ---> Zoom-in Peak detection (narrowing capture bandwidth to reduce noise floor).
Power ---> peak based power estimation.
Model refinement --> improve any propagation models by fitting its parameters with measurements. ---> weighted regression fitting to allow good fitting even for weak signals.


Enhancing the Quality of Electric Load Forecasting Methods
Scott ALfeld, Paul Barford

Predicting Electricity: compnents, buildings, cities, countries, days, years.

This work looks at buildings and cities, not single homes.

Problem:
Many, many buildings make up cities
Easily available: Historical electricity usage
Less so: Everything else (what if you have a new building, a new event?)

Data: hourly electriy usage for 2-3 years, NAICS Code ("business type")
Task: Predict tomorrow's (hourly) load

Solution:
Latent Feature Extraction
Anomaly Detection
Priority Encoding

Latent Feature Exatrction:
Solid, but unknown features: number of floors, surface area
Fuzzy and unknown features: sensitivy to humidity, behavior during holidays, ??? (something we didn't thought of)

Anomaly Detection:
Individual Anomalies (for one building): boiler upgrade, 50%-off sale
Shared Anomalies (for multiple buildings): three-day weekends, snowstorms

Priority Encoding:
Accuracy may not be the most important thing. Peak is much more important than total usage.
Peak prediction and Discontinuous penalties: MSE: 1/n* sigma(ti - yi)^2 doesn't work because peak error are more important for your prediction than errors which are made for free hours; you also don't want false positive peaks.
Challenges: need to aligin the peak with actual peak, but not a straight line at peak value! It needs to be easy to optimize on.
We don't have an solution for that yet....


Observing Home Wireless Experience through WIFI APs
Ashish Patro, Suman

A measurementinfrastrure which could capture the "wireless experience" in home WLANs.
Questions asked:
1. What is the wireless performance at any given moment?
2. How often does poor performance occur?
3. What are the causes of poor performance?

Wise infrastructure (wireless infrastructure of in-line sensing)
Data collected:
Signal strength of devices, air timeutilization, latencies. wifi packet statistics.
Time correlation analysis
Characterizing wireless performance:
   metric: only passive and corse local measurements, capture impact of links, provide app anoglistics
    Input: air time utilization, total contention, ????

Results:
Cause of poor performance
in building 1, when performance is poor, 60% of time we have high air time utilization and high packet loss, thus indicate channel congestion
in building 2, when performance is pool, latency is high and packet loss is high, indication a week signal source

Interference of Microwave oven:
Interference is short, but present in most buildings 

















New ENcryption Primitives for Uncertain TImes

Encryption is one of the few things that you can rely on.

But existing encryption primitives aren't always helping:
DPI can block protocols (like Tor) --> encrytpion tools are easy to detection ---> format transforming encryption
Dropbox has access to your data --> encryption doesn't allow de-duplication to save space --> message-locked encryption
Passord-based encryption often get bad passwords ---> de-encrpty yield garbage for wrong key, but meaningful things for correct key, thus good for brute force attacking--> Honey encryption

1. Encryption doesn't try to hide its persence
    Stegonagraphy: emmed bits into HTTP messages; but too slow
    Obsfproxy: encrypt all bits sent over network, but DPI will flag traffic as "????"
    Want:  Trick DPI to think its HTTP tranffic
     Can we build encryption which could fool regular expression matching?  ---> Format-transforming encryption: take clear text and a regular expression R, and the encrypted text is guaranteed to match R.
      Mechanism: Ranking for DFAs, but now for regexes (exponential complexity, but it's OK here)
      Can DPI adapt to detect FTE?
       1. Use R1, R2 against FTE --> flse positives
       2. FInd R that matches against FTE, but not legimate --> fast to change FTE formats
       3. Find non-regular checks ---> speed?
       4. Costum fight back? ----> ???

     FTE beats China's blocking of Tor, but Tor guys are skeptical about deploying an "academic" product (like they weren't an academic product. themselves....)

2. De-duplication doesn't work with conventional client-side encryption
    Prior work (in distributed system literature, surprisingly): convergent encryption (CE)
     Their solution:
        M --> KeyGen(M) ---> K (being used for encryption Enc(K,m)--> ?????
     Big idea: message is "shared  encrypted"
      But it has brute-force attaching problem.
       DupLESS is built to overcome the brute-force attaching issue of MLE

   Q: If we store the same data, then I know that you know (that I know...?)
   A: Yes, but this an orthogonal (side channel) issue with what we are looking at. 


3. Password-based encryption -> people don't pick good passward, you need to assume it is drawn from a small set
  
message + secrate key K ---> Cipher text C. Brute force attach possible by iterating K ( only one decrptyed text is not garbage)
Prior solution: PKCS5 standard, which do decryption b lots of hashing, thus attacker has to guess the right key each single time, but slow, and it onl slows down previous attack by constant factor.
Idea: increase the complexity of choosing the right key by producing plausible output if you use the wrong key to decrypt
Honey Encryption: same API as password-based encryption schemes, but use special encodings to ensure that decrypting ciphertext with *wrong*  key yield fresh sample from some target distribution (like an uniformation distribution in natural language).
Honey encryption for prime numbers: P (prime number) + K ---> C. Attacker can just try decrypt with each key and look for the prime numbers. Now you want an encoding so that every decrption of C yield a prime number.
           P --> (distribution transforming encoder) --->> Uniform bit-string ----> COnventional Encryption with K --> C
           Attacker: C --> decrypt with K' ---> Fresh Uniform bit string ---> Distribution transforming unencoder  ---> P'
      No attacker A can recover message with probability bettern than 1/q (the size of set where you draw password from)  no matter how long you runs 

Hopefully Next year we will have a general encryption scheme which covers all three above (because they are based on similar ideas anyway).

Q: Have you looked at other security real-world problems that you think could use an interesting encryption scheme, but you don't know what that encryption sheme yet?
A: Yes, all the time. LIke Gmail encryption, you lose spam detection, and we don't know how to solve it. In general, if you want data to be safe, but you also want third-party to do something about your data (de-duplication, spam detection), you have a problem.

SyNS'13 session 3: Fast & Safe

How is My Driving: Sensing Driving Behaviors by Android Devices
Lei Kang, Suman

Snapshot: hardware solutions to count driving miles and hard brakes (speed decrease mph per second). 30% discount for "good drivers).

No special hardware approach:
rating system based on sensors (acelerometer and gyroscope for acceleration, brake, ture, change lane)
Keep track of traffic and road conditions.

Challenges:
  1. People put their phone in different location
      Solutions: coodrinate projection to tranlsation the phone coordination in any rotation to the earth coordination. 

  2. Hard to defind good or bad hehavior
      Soluton: compare system rating and passenger rating

Related work:
Driver phone detection by car speakers(mobicom'11)
Driver phone detection by different central force reading on accelerometer(mobisys'13)
Distracted driver detection by mobile phone cameera (mobisys'13)

Q: Immediate feedback would be useful?
A: That is realtime detection of driving behavior. What we are doing is offline study, but online study wouldn’t be difficult.


Rinnegan: Efficient Resource Use for Heterogeneous Processors
Sankar, Mike Swift

Task scheduling: task can be run on different processing units (Parallel for GPU or CPU, encryption for CPU or AES). Not always good to use GPU because contention and data transfer overhead.
Power is limited: all units cannot be used at the high performance. Now we cap power usage pretty well, but don't distribute power smartly over different units

Rinnegan:
 Adaptive task scheduling, power awareness.
1. Adaptivity is achieved by OpenKernel, which expose resource utilization and allow application to make task placement decision. Don't do it transperantly in kernel because application knows best, and accelator can be accessed directly by apps.
    Impletment by accelerator stub, which predicts utlization, acceleartor agent, which implement scheduling decsions, enforce power limits and expose utilization informatio, and accelerator monitor, which publishes information exposed by agents to applications. 
2. Power Abstraction
  Power credits maintained by a central power center, power agent calculates power model and ask for power credites.




SyNS'13 session 2: Storage

Storage Class Memory Needs Flexible Interfaces
Sanketh Nalli, Michael Swift

Different SCM:
1. Phase-Change Memory (PCM) -- reads fast as memory writes fast as flash
2. Spin-transfer torque MRAM
3. Flash-backed DRAM
4. ???

Now it relies on FIle system because:
1. global naming
2. protected sharing (deletee open files, etc.)
3. Crash Consistency
Some use direct mapping for data, but go through FS for metadata

But name-space operations matter!

Causes of problems:
1. Trapping into the OS --- everything is a file
2. Fixed interface exposed by the file system

What if POSIX FS for latency app, and PROXY FS for new app?

Solution: Aerie (A user-mode file system which provides flexible interfaces)

SCM Manager: allocation, protection, addressing
LibFS: namespace implementation
Trusted Service: integrity and consistency

Two file system implemented, one POSIX-like, one Key-Value store like



Fault Isolation and QUick Recovery in Isolation File Systems
Lanyue Lu, Remzi

A small fault can cause global failure -- all files on a file system share a single fault domain.

Solution:
A new abstraction of fault domains. (Isolation FS)

File system failure behaviour study:
1. What global failurs are used?
     Read-only, crash. Over 67% of global failures will crash the file system
2. Causes of global failures??
     Metadata corruption, I/O failure, or software bug (null poiter, say). All global failures are caused by system states, not user data.

Current FS abstractions:
File and directory, namespace, partition(panic still take down whole OS, and inefficient)

Isolation File Systems:
  File Pod: an abstract partition, which contains a group of files and "related metadata"
   1. Isolate metadata for each file pod
   2. Covert global failure to local falures.
Impletation on ext3:
   1. A file pod contains multiple block groups. A pod id is inserted for each group descriptor. 
   2. Deframnation for groups.
   3. Journal isolation: virtual journal contains updates only from one pod

Q: Is there a case where you want global failures?


Optimistic Crash Consistency:
Vijay, Remzi

FIle system conflate ordering and durability, which makes consistency expensive.
They provide ofsync() which priovides ordering without durability, which leads to a high performance file system which provides strong consistency as ext4, just not much timeless. 

Techniques:
1. Disk notification when a block hits surface
2. transaction checksum to detect re-ordering in journal blocks
3. Some other techniques

Insights:
More and more layers between app and storage. Need interfaces that provides freedom for each layer but still have correctness.



SyNS'13 Session 1: Cloud

Next Stop, the Cloud: Understanding Modern Web Service Deployment in EC2 and Azure
Keqiang He, Aditya Akella

A global, in-depth understanding of lass cloud use. Questions:
1. Who is using public clouds? (cloud-using domains, traffic profile)
2. How are Web services using the cloud (impact of failure etc)

Datasets:  University packet capture and Alexa subdomain DNS records

Domain: 4% domains are using cloud, among those, 90% are using Amazon cloud.
Traffic volume: 81.7% go to Amazon, and 18.3% to WIndows Azure
Frond End: 8% PaaS front end, 4% Load Balancer frond end, 72% VM front end
Regions: 97% of the domains are using just one region ---> single region failure can take down a large fraction of cloud-using services
Availability Zone: 33% domains are using 1 availability zone, 40% are using two.

Highlights:
1. 4% popular services using EC2/Azure
2. VMs are most popular EC2 front end, value-add features used b top domains
3. LImited zone and region usage.

Question:
Q:Dropbox is dominating so much. If Dropbox is taken out of the picture, would the result still be the same?
A: For flow analysis, I don't know. For domain analysis, it would be the same.



Viewbox: Integrating Local File Systems with Cloud Storage Services
Yupu Zhang, Remzi Aparci-Desseau

Is your data really safe?  Not really in Dropbox even if you have many copies...
Corruption populate to other copies.
Crash inconsistency ends up everywhere
"Out-of-sync" synchronization: doesn't handle client crashes correctly.
Causality incosistency: files uploaded out of sync, and cloud state won't match an FS epoch.

Why:
Semantic gap between file systems and cloud. These are designed separtely.

Viewbox: view-based synchronization
Views: in-memory snapshots of the synchronizatin folder, crated at FS epoches (e.g, journal commit time)
One active view (to get FS changes), one frozen view (to sycn client),

Compontes of Veiwbox:
1. ext4-checksum  ---> corruption detection and inconsistency detection (like transation didn't commit)
2. ViewFS (an in-memor extention to ext4-checksum. Keeps an operation log to all the metadata, so it could later appy it to frozen view?)
3. Cloud-aid recovery.

Stratos: A Network-Aware Orchestration Layer for Virtual Middleboxes in the Cloud
Anand, Aditya

(This is the middlebox scaling Aditya was talking about last year)
State of the Art: in order to steer traffic to middles boxes, tenants need to overlay network on top of cloud network

Stratos:
Forwarding controller: tranlate policies of middlebox chains to forwarding rules.
    1. Split chains into subchains at mangling/connection-terminating middle boexies
    2. Calculate forwarding rules.
Data plane
Resource controller:
     Monitor end-to-end performance of chains, then decide resources on each chain. (Use flow distribution, horizaontal scaling, instance migration and scaling up middlebox, from lighter-weighted to heavier-weighted)






2013年11月13日星期三

Verification games: Making software verification fun

Michael Ernst, U Washington

Software engineering:
   mathematics (modeling, analysis), but there is an non-ideal component: people!

Crowd-sourced software verification.

Programming is like a game. But when is it not fun?

Code ---> automatically translated--->Game---->People palying--->Completed game ----> Automatic translated -----> Verified software (with proof/annotations)

Idea:
Encodes a constraint system for both code and game.


Example:
Code:Assiginment can only go from a type to its supertype.(variable: ball, assignment variable: pipe)
Code: null pointer errors (small ball: non null value, big ball: null vablue, pipe: dereference)
Game: ball can only fall into pipes wider than the ball
Cheats: cut the ball size to win unwinnable game (bug detected!)
Most accurate ituition: type constraints
Different pipes of the same color at different board: different use of the same variable
Different boards: different methods of a class
Level: a class
You need to solve all boards of one level at once. (Non-local reasoning)

Type flow vs. data flow:
Good for type flow (no loops, specif values, etc), not good for data flow

Other examples:
SQL injection
race conditions and deadlocks (if you use lock)
unintended side effetects

Why type systems for verification is good for games:
Modular, local reasoning and non-understanding

3-way collaboration: machines, players, verification experts
Machine optimization: simplify the challenge to its essence. (Abstract interpretation, constraint propagation, heuristic sovling) --- to make the game high level, playable. Too much details: player distracted, too little detail: unable to produce useful result.
Players: utilizing a community (20000 yrs people spent playing angry birds until 2011!), game design point. Same problem can have different skins --- different game.

Game requirment: abstraction (easy playng), modularity( different people work on different parts of the program)

Scoring System: Guide people to do more effective verification. In some sense, designing a good scoring system is the whole problem.


Why not covert everything into SAT and gamefy SAT:
Size explords and destroys problem structure thus no human intuition

Big questions:
How do we gamify software engineering, or even computer tasks in general?


SOSP'13 papers

Some useful links:
SOSP program page: http://sigops.org/sosp/sosp13/program.html (paper, slides, online questions and talks)
syslog: live from SOSP 2013: http://www.syslog.cl.cam.ac.uk/2013/11/04/live-blog-from-sosp-2013/




Tango: Distributed Data Structures over a Shared Log
Microsoft Research and Mircorsoft
Presented by Vijay

This is built on the same principle of log structure file system, but is distributed. It is built on SSD arrays thus random reads are fast (necessary for a log structure)

They keep a global sequence number to achieve serviceability. However, writes of different client and different sequence number will occur at different location of the log, thus could happen concurrently.
Transaction is implemented in an optimistic fashion:
A transaction starts with a transaction begin block in the shared log, followed by a read block which records the version number of all the objects this transaction is going to read (thus depend on), then followed by whatever updates this transaction made. Then every client could commit this transaction if the objects' version in the read block hasn't changed, without any coordination. If the objects in the read block has been updated, then the transaction just abort and restart again.

They claim that garbage collection is not a big problem because on flash cards you can do thousands of operations per second even with garbage collection going on.

This makes it easier to build distributed data service. They have built ZooKeeper in ~1000 lines of code on top of this. It is also likely to have higher performance because in a traditional distributed consistency protocol, communication overhead is high with many participating nodes, thus the good performance provided by SSD may  not be fully utilized.



The salable commutativaty rule: designing scalable software for multiple Processors (best paper)
Robert Morris, MIT CSAIL

Motivation:
Instead of relying on running workload to find scale bottleneck, we should look at interface design and its implementation.

Proved by simulation construction:
Whenever interface operations commute, they can be implemented in a way that scales.
      e.g., create with lowest fd doesn't commute, but create with any fd does commute
      Intuition behind: if operations commute, then the results are independent of order, thus you can implemented it in a communication-less way, thus scale well.

Commutativity are sensitive to operations, arguments, and states. (Thus more or less a dynamic thing..)

Commutor:
Take specification (e.g, POSIX, in the form of symbolic model) and implementation (e.g., Linux) as input, and output all scalability bottlenecks. It finds the bottleneck by firsty check the interface and analyze all communativity conditions (e.g, for rename, commute if both source file exist and name not the same, or neither exist, or both call self-renames, etc.) then exhaustively generate test for all possible pair of cases, see if they are conflict-free (i.e., no accessing same state), i.e., commute, and output these compute test. Finally, it will run these test cases on an implementation (say, Linux), and find for the commutative ops, is Linux generating any share states, or making them not commutative.

Results:
They use commutor on Linux and found 68% of the cases are noncommunicable. They then build sv6, removing most of the bottlenecks.

My comments:
1. How useful this thing is actually depends on the richness of the interface you provide. And here "interface" is used as a broad term, it also includes the underlying stats it might touch (say, inodes for renames), which you could argue is part of implementation...The trick is to give it enough details so that it doesn't report everything as scalable, and don't give too much details so that you give specific implementation details as interfaces.
2. The tile is "interface", yet they didn't seem to pin-point a place where POSIX interface is not scalable. (Well, maybe some small places, like create() with lowest fd instead any fd). In general, POSIX are pretty good in terms of scalability.
3. The graph the provide (figure 6 in the paper) are quite interesting. It reports conflicts, but didn't say how frequently those conflict happens in a real world though (say, I might have a global lock for a rare operation). But I guess you could argue that commutor just provides information, it is up to you to decide how important those conflict commutor finds are.
In general, this is a pretty novel paper and it certainly changed the way I think about scalability.


Toward Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior
Armando Solar-Lezama, MIT CSAIL

This is actually a PL related paper done by pure system people.

High level idea:
Undefined behavior in language causes problem because programmer and compiler operate under different assumptions.

Approach:
Since compiler usually optimize away unstable code (code which invokes undefined behavior), let's run the compiler (STACK) two times, once with the assumption that undefined behavior may be invoked, once with the assumption that undefined behavior would never be invoked. Compare the delta to find where compiler discard code, and thus find unstable code.

Limitation:
I think undefined behavior doesn't manifest them just in a way that compiler optimize away code. Sometimes it is just a hidden assumption when you implement the compiler. For example, you choose a particular order to evaluate the function arguments. This couldn't be found by this framework.

My comments:
1. Undefined behavior invocation is a manifestation of a gap of semantics/agreement between programmer and compiler. And this kind of gap happens not only in language compiling, but also in file system (eg. Thanumalayan Sankaranarayana Pillai's recent work), in cpu, and in all layers of systems where we have a stable API and people expect the lower layer to just work. This paper provides a framework to identify and analyze it in compiler, can we extend it to work for all layers? What can we borrow from their insights?
2. How hard is it to implement a compiler which doesn't use undefined behavior? Maybe I shall try it next semester in 706...




2013年11月1日星期五

RETCON: Transactional Repair without Replay

This is a paper from the architecture community (ISCA'10), and is solving transnational memory problem. But maybe it is useful in a journal file-system's context?

Inspired by selective replay [Sirnivasan+'04, Sarangi+'05,...]

Observation: main data are concurrent, but aux data are conflicted. Auxiliary data will serialize parallel operations, thus significantly reduce performance  (Example: hashtable's size etc.)

Basic idea: use symbolic tracking to ensure that if aux data conflicts, we repare, but the main data could be updated concurrently.


Approach:
1. Use symbolic values of input/output (keep track of what variables we are storing at each register, and what variables we are writing from register to the memory).
2. At commit: if somebody else changed the aux data's value when we are in our transaction, we just repair by update the aux data again, but the main data structure remain being updated concurrently.
2. Track constraint on inputs: if constraint is not violated, then not alternative code path needs to be taken (no change in control flow),  Then 1 and 2 would work just fine, but if the control flow changed, we need to abort the transaction and restart the execution.

How this could be applied to file system?

Related work:
•ReSlice [Sarangi+’05]
        Maintains insns in dependent slice of conflicting operation in TLS
        To repair, re-executes these instructions
•Dependence-aware transactional memory [Ramadan+’08]
        Forwards speculative values to optimize ordered communication
        Unlike RetCon, can’t handle conflicts with cyclic communication…
        But OTOH, can handle arbitrarily complex computation
•Advanced TM interfaces & data structure implementations
        Open nesting [Moss+’05,Moravan+’06,Ni+’07]
        Transactional boosting [Herlihy+’08]
         Abstract nesting [Harris+’07]
         Lock-free hashtables [Click’07,…]
         Scalable non-zero indicators [Ellen+’07]

external systmes research at Google

Bradley Chen
Principal Engineer, Google

He had an amazing paper at SOSP'93, which describes different OS structure impact on memory system performance, really showed how you could do interesting detailed measurements on systems.

Google refuses to put researchers in an lab, but on the field.

Ph.D teaches you a systematic way to approach creating new science, and to advance the state of the art.

Google's supprot for university research:
1. Faculty research program (for faculty)
2. Research fellow program (for Ph.Ds)
3. Own publications
4. Open source projects
5. Conference sponsorship

Some examples:
1. Fighting fraud on the internet (Stefan Savage, UCSD)
    Computer security is not just a technical problem, but it is really about people (Adversaries, victims,  defenders). Thus social-ecnomic factors come in.
     E-mail spam: complex value chain relationships, email programs, DNS, web-server, goods/service shipping, financial transactions etc. Measures all the parts. Found that most resources are cheap and profitable, but merchant banks are very few (actually, 3). Result: targeted payment intervention. (Really cool!!!) Remzi said: don't they worry about the Russian guy showing up in their labs?!
     Q by Remzi: maybe it's a bigger problem if drugs are cheaper overseas? Then you cannot really stop the brokers. 
      I need to read more about their methodology of doing these kind of measurement. They do collaborate with banks to track money flow though.

2. Integrating Circuit Switching into Datacenter (UCSD)
    Network doesn't scale very well to datacenter scale, because with larger bandwidth, switch comes with fewer ports, then you need a deeper stack of switches to connect the same cluster.
     Key Idea; Hybrid Circuit/Packet networks. (Actually it's just optical/packet switch network).
     Mordia: Circuit switching to the ToR---fast control plane (~100ms to ~100us), and fast OCS.
     Mordia network model: Taffic Matix(time). Advances from original hotspot scheduling, they make reconfiguration much faster by 1. faster hardware, 2, faster way to observe the network using matrix. TM to represent traffice needs ---> TM' to represent bandwidth allocation, then decompose TM' into scheduler (I didn't really understand how though....)
   Faster hardware: 3D mirror setting to a 2D mirror setting. reconfiguration time redues to ~100 us.
   Remzi: Google stopped saying that network is their bottleneck anymore in their papers (like the earlier papers such as GFS), doesn't that mean Google has new ways to do networking?

3. Disciplined Approximate Computing.
    When you convert analog to digital,it is just an approximation. But you need to be aware when approximation happens, instead of doing it recklessly.
    "Diciplined" Approximate programming: decide when approximation can be used. They have languague and architecture support, which allows you to specify when approximation can happen. A didtial NPU: compute as a neuron.
     results: 2x speed up, 3x energy reduce, and less quality loss.

Q: When Google fund a project, is it just based on merits, or is it also based on other funding sources which might be availabe to this project.
A: Vritually all the projects Google likes have some other sources of funding, like NSF.
Q: Does Google do lobbying on research a lot?
A: Google is doing more and more on that, for better or for worse.  Actually UK or Switzerland is a lot robust in terms of supporting research, but Russia or France is much worse than US.


2013年10月11日星期五

RAMCloud and the Low-Latency Datacenter

RAMCloud and the Low-Latency Datacenter
John Ousterhout, Stanford

Low-latency Datacenter in General:
     Datacenter: large compute power + large amount of storage + fast communication, homogeneous, centrally controlled
      How many datacenters: $1K-10K per person for IT infrastructure?
      Innovations in Datacenters:
               Phase 1: manage scale (map-reduce)
                Phase 2: low latency (from 1-2ms to 2-10 us)
                                 In current datacenter, application is separated from its data (unlike in a single machine)

RAMCloud (Storage system for low latency datacenter)
Goal:
1.       All data in DRAM (no cache misses)
2.       Durable and available
3.       Large scale
4.       Low latency (5-10us remote access) for small chunks of data

Architecure:
                App + network + storage servers + coordinator
                Storage servers: master + backup. Master manages memory and data requests, backup manages secondary storage and durability, talking to other masters for availability etc.
               
Operations:
                Read, write, delete, cwrite(conditional write based on version number)

Data Durability:
                Eventual durability, log structure approach (with small non-violtial memory).
                Crash recovery: log replaying in large scale (?)
Log-structured memory:
            Don’t use malloc (waste at least 50% memory because of workload change), instead, structure memory as a log (80%-90% memory utilization)
New Projects:
                Higher level data model  (for database, say, secondary indexes?)
                System architecture for RPC
               
Threats to Latency:
                Layering (how to organize software structure but still low latency?)

                Network buffering (esp. TCP just tries to find empty buffers!)

2013年10月7日星期一

Differentiated Storage Services

Intel Labs
SOSP 2011

Key Idea:
1. Classify I/O requests. Classifiers preserved across different layers: applications, file systems, and block storage systems.
2. Division of labor for I/O performance optimization:
     File systems or Applications: develop classification schemes based on its own semantics (how to classify, policy for each class)
     Operating Systems: Preserve classifiers when passing/translating I/O requests.
     Block Storage System: Implementing the polices using its internal mechanisms.

Example:
    File System classify based on block types, and assign priorities to each block class
    Storage systems implement priority policies by giving cache preference to higher priority blocks (thus no priority reversion)

Comparison of DSS and IOFlow:
     1. IOFlows give differential treatment to VM level I/O flows, while DSS give differential treatment to each I/O request. (Different Granularity)

Reliability/Corrutpin model and some key-value store buiding stuff


Zettabyte Reliability with Flexible End-to-end Data Integrity

Data corruption could go undetected, thus high-level (end-to-end) integrity needed
Checksum used (strong checksum needed)
Drawback:
         Performance bad (need to compute checksum) – to change checksum online
Detection to late (want to detect before it goes to durable storage!) --- solved by let every component knows about the checksum

         They use corruption and checksum model to model the undected corruption probability (for a b-bit block)
        
Zettabyte reliability?????
         Less than 3.46*10e-18 for 4KB block (17.5 score)

Improving disk reliabity doesn’t improve overall reliability that much (maybe because disk corruption prob is already small???)


How about add compue overhead as a parameter?
        























PUE: (factory to data center. Used for transformer, etc.)
2005: 2-3 2012: ~1.1

Proportionality: (when idle, don’t use power, when computing in full load, use full power)

Effergy efficiency: 1GHz sweat spot

Increasing speed cost once:
1.       Once for switching speed
2.       Once for memory wall (caching, prefeching, out-of-order execution)

So, sweat spot configuration (wimpy nodes)
1.6G Dral Core
32-160G flash SSD
Only 1GB RAM

Design key-value store from the very bottom (hardware) up
Fast front end cache (cuckoo caching?)
Backend: log structure data structure + hash table index (instead of using file system)
Partial-key caching (complete key stored along with data to deal with collision) to enable memory efficiency

Then hardware change: (CPU 6x, memory 8x, SSD: 30-60x)---so CPU and memory have to keep up
How do you minizie memory per entry???
Static external dictionary theory problem
EPH – 3.8bits/entry
Entroy-coded tries: 2.5 bits/entry

Think it as a pipeline!! (how??????)
They shadow writes then batchly coping them to main index(?)
So they can only support up to 10% puts

Problem: linux kernel I/O stack too high
Load balancing:  add cache in front end to deal with hot spot
Proof: only nlogn cache entries needed for n loads to achieve almost perfect load balancing
So only need L3 for caching!
Intuition: (didn’t understand)
        
Their solution can’t deal with hash hacking (assume hash function invisible)


Some tradeoff: more reads to avoid some writes?
Bottleneck always in I/O? and also flash performance bursts

They want to manage raw flash, and treat it as many sequential writing devices(???????)
Now they have an SCSI command to exchange (remap) mapping on SSD, and they can do cool things using that





Korean Android Trace/Measurement

They build a FTL simulation tool, and also back-mapping from block number to inode number to file path.

Using that we ware able to measure:

1. Write size (sectors) and write requests distributions on file types and application.
    For almost every applicton:
    -- ext4 journal writes are the most
    -- then android and database temporal file I think

2. Avg overwrites on a block distribution on file types and application
   for each application:
     -- many overwrites on the same block for ext4 journal files and database/database journal files

3. Avg write size per request based on file types and application
    for each application:
     -- 1.x blocks per write request for almost all kinds of writes
     -- 6-7 blocks per write for ext4 journal

4. Traffic source --- how much data is issued for write for each type of file (per application)
     -- for every 1kb user data, seems like 3kb of android&kernel temporal data is written, and 2kb FTL garbage collection & FTL other writes

The science of crash consistency

barriers are not used
but we usually get away with it becuase journal is sequentially written

consistency is not a binary thing -- probabilistic crashconsistency
What's the probability of inconsistency upon a crash?

P = time in (inconsistency) window / total time
(They assume/observe that for a give storage configuation, there are not many ordering obseved. So they average these out)

Different re-ordering:
commit-commit: second transaction before the first transaction
commit-data: data block before transaction commits
commit-metadata: metadata block before transaction commits
checkpoint-commit
checkpoint-data
checkpoint-metadata



IO Priority: Related Work


Labeling I/O requests globally:


Differentiated storage services 
SOSP 2011 M. Mesnier, F. Chen, T. Luo, and J. B. Akers.
Applications label IO requests. The label then propagates with the requests from the file system to a block device using the SCSI protocol.

Diagnosing performance changes by comparing request flows. NSDI 2011
Diagnosing performance changes by comparing request flows SIGMETRICS 2006
Explicit labels for IO requests in distributed system.


QoS for Storage Systems:

A Fresh Approach to File System Quality of Service NOSSDAV 97
A Practical Learning Approach for Dynamic Storage Bandwidth Allocation IWQoS2003
Providing QoS guarantees for disk I/O. Multimeida Systems, 57-68 Feb 2000
Travelling to Rome: QoS specification for automated stoarage system management, IWQoS 2001


Multi-tenant storage control:


Proportional allocations of resources for distributed storage accesses. FAST 2009
Proportional share scheduling for distributed storage systems. FAST 2007

Performance Insulation for Shared Storage Servers. FAST 2007

Performance Differentiation for storage systems using adaptive control. ACM Transactions on Storage

Performance Isolation and Fairness for multi-tenant cloud storage
David Shue, Michael J. Freedman, and Anees Shaikh.
OSDI, 2012
   
System level max-min fairness at the key-value store level.
Assume well-provisioned network, do not deal with in-network resource sharing.  
They mainly deal with: where to place datasets, where to replicate datasets, how to allocate weights to each node, and how to do faire queuing locally to ensure DRF even though some datasets are more (dynamically) popular than others. (They use DRF so they do deal with multiple resources, namely bytes received, bytes sent, and requests. bytes bound by network/storage, requests bound by interrupts???)

IOFlow: A Software-Defined Storage Architecure

from Microsoft Research and Microsoft
SOSP 2013

This is the closest architecture to what I imagine Software-Defined Storage should be like.
They borrowed the whole set of ideas from SDN, especially the separation and programmability of the control plane.

Key Idea:
1. Define IOFlows as {VMs, operations, files, shares}, basically source, destination and type (read/write/create)
2. A language to describe end-to-end policies over different I/O flows (bandwidth, priority, middleboxes etc.)
3. Define stages as different I/O request processing points (storage drivers, NICs, etc.), and expose a standard queuing abstraction. (analogy to switches and openflow in SDN)
4.  Centralized controller to realize policies by specify queuing rules at different stage (analogy to SDN controller, distributed state abstraction is also realized here). The rules are updated periodically based on running statistics.

A lot of challenges can then be solved by directly using SDN-based techniques.

Challenges Specific for Storage:
1.  No standard I/O request header (like network packet header)
     Solution: controller maps flow names to low level identifiers (files, blocks, etc.)
2.  I/O requests, unlike network packets, have different types (read/write/create), which requires different treatment
     Solution: releasing tokens based on end-to-end cost of operations. This cost is estimated by running benchmark to get the cost of I/O request as a function of their type and size.

Unsolved Problems:
1. In order to realize some policies, need to enforce rules at EVERY I/O processing point. (e.g, HOL blocking for priority)
    For example, file systems(journals), I/O schedulers, network switches, distributed file systems, etc. How to enforce policies at these places, how do they work together? They didn't address this.
    (The unsupported layer problem could be alleviated by controlling the numbers of outstanding I/Os in those layers though)
2. Didn't consider the impact of network.
    They assume a good enough network (40G/s), didn't consider the possibly complicated storage/network interactions.
3. Doesnot consider the location information of IO requests.
4. Need to operate on the VM granularity, no fine control.
5. This kind of queuing model doesn't work well with cache mechanism.

2013年6月27日星期四

Some Cache Oblivious Database

1. TokuDB:
    paper here: "Cache Oblivious Streamng B-trees"
    http://www.cs.sunysb.edu/~bender/newpub/BenderFaFi07.pdf
    Some description in the bottom.
 
   TokuDB is good for bulk insertion, then bulk look-up. It is not so good for a mix of insertion and look-up. This could be potentially alleviated by adding a bloom-filter on top of their B-tree to do search. But they didn't implement it.


2. Acunu's Castle project
    http://www.acunu.com/2/post/2011/07/castle-behind-the-scenes-with-acunus-new-storage-engine.html
    They have bloom-filter built in.
    They have the whole thing built in kernel, and have a ring for user space application to put/get key/values.
    They also batch system calls into the kernel to reduce overhead.
    They seem to use "doubling arrays" as the key data structure, but it seems very similar to TokuDB's fractal trees. If so, they might have the same downsides too; good for large number of updates/writes followed by a large number of lookups/reads ; bad for a mix of updates and reads.






数据库索引原理(1)-----TokuDB中的COLA-Tree/Cache-Oblivious Lookahead Array(Fractal Tree)

       目前无论是商业的SQL Server,还是开源的MySQL,都基本上还在用比较老的B+Tree(SQL Server用的是标准的B-Tree)的索引结构。从原理来说,B系列树在查询过程中应该是不会慢的,而主要问题就是出现在插入。B-Tree在插入的时候,如果是最后一个node,那么速度非常快,因为是顺序写。但如果数据插入比较无序的时候,比如先插入5然后10000然后3然后800这样跨度很大的数据的时候,就需要先“找到这个数据应该被插入的位置”,然后插入数据。这个查找到位置的过程,如果非常离散,那么就意味着每次查找的时候,他的子叶节点都不在内存中,这时候就必须使用磁盘寻道时间来进行查找了。更新基本与插入是相同的。如果是一个运行时间很长的B-Tree,那么几乎所有的请求,都是随机IO。以上就是B-Tree在磁盘结构中最大的问题了。
       随机IO几乎是令所有DBA谈虎色变的一个问题,当数据量小的时候,所有数据都能一下放到内存中那么就没有这个问题(其实这个时候也就没有必要用B-Tree的这种块结构了),但是一旦数据量大于内存的话这个问题就陡然出现了。其实从本质来说,k-v存储要解决的问题就是这么一个:尽可能快得写入,以及尽可能快的读取。
        在分析解决方法之前,我们讨论几个极端。走一个极端的话,如果我每次写数据都顺序写,那么对Insert来说的话是最快的,但是每次Query就没法读了,因为我每次都需要Scan一遍整个表。那么如果我想获取最佳的读性能,那么方法就是像B-Tree那样全部排个序呗。但是因为B-Tree有那样的随机IO,这样我们有没有办法得到顺序写的写性能,有同时达到B-Tree的读性能呢?目前的方法主要有两种,一种是LMS-Tree(我将会在后面的博客中再做介绍,如果您实在等不及请见July大牛的博文,另一种是COLA Tree。下面介绍的就是TokuDB中采用的COLA-Tree

一、原理 
        我们假设有这样一种集合的结构,相邻行空间加倍。每一行要么全满要么全空,全满行的数据都是排好序的。
        数据插入的时候,以上图的数据存储状态为例,如果再写一个值的时候,会写在第一行,比如写了3,这个时候第一行是空的,所以就放到第一行。再写一个值11的时候,因为第一行已经写满了,所以将3取出来,和11做排序,尝试写第二行。又因为第二行也满了,所以将第二行的510也取出,对3,11,5,10进行排序。写入第三行。最后的结果就是:

二、磁盘IO访问的性能
        分析一下磁盘IO量。合并两行数据总数为X一共需要O(X/B)次IO(B是磁盘的一个block的大小),平摊到每个数据上只要O(1/B)次。如果数据总数为N,那么一个数据最多只可能merge logN次,所以做一次插入的cost只要O(logN / B)。对比使用B-Tree作为索引的话,由于B-tree高度为LogB(N),对于磁盘IO来说,访问磁盘的次数为logB(N) = logN/logB,对于数据的query来说访问磁盘的次数是一样的。
        这是TokuDB做的分析,其实我觉得这个分析有点牵强,我觉得这有点欺负B-Tree只能单个数据单个数据的插入,对于大量数据的同时插入没有优势。如果CacheOblivious Tree也是单个的插入,那么单个插入访问IO的最坏情况是O(N/B)?
        翻阅原论文,原论文中承认了单个插入访问IO的最坏情况是O(N/B),但这个论文里提出了方法将最坏的这个情况改善了,主要方法是对每个level复制一个同样大小的辅助数组(如下图的两行一组的结构),所有被插入的数据先放入原空间,如果原空间满了则放入这个新建空间(如果新建空间同时也满了则是算法设计失败)。Insert的时候由一个线程完成,对于要插入一个数据的情况,这时把要Insert的数据放入Level-0的空间中(如果原数组空间为空则放入原数组空间,若满则放入辅助数组,如果辅助数组满则是算法设计的失败);另外一个线程进行merge,merge的时候按下图的结构从左到右进行,把全满的一对level空间合并排序到下一个level存放(下一个level的原数组空间为空则放入原数组空间,若满则放入辅助数组,如果辅助数组满则是算法设计的失败)。每一次merge的结束有两个标志,其中之一是所有需要merge的level对全部merge完,其二就是merge过程一共移动了m个数据。那么m的取值是多少呢?应该满足两个原则,其一是需要保证每次merge不能使得相邻两个level同时出现全满的情况;其二就是m的取值应该尽可能的小(很显然插入的最坏情况下的时间复杂度是O(m)的)。这些牛逼的作者们推出了这么一个定理:对于klevel的数据库来说,每次merge的时候只要从低到高移动2k+2的数据项,就能保证相邻的两个level不会同时出现满员的情况(这个证明过程说实话我没有看的太懂,希望懂的各位能够指导我理解一下,谢谢各位!论文在此的第9页的LEMMA-21)。这样对于一个数量为N的数据库来说,level层数为log(N),那么每次merge要移动的数据量为2log(N)+2,即IO次数只要(2log(N)+2/B,即O(log(N)/B)。于是这个最坏情况的IO就神奇地从O(N/B)降到了O(log(N)/B)。

        看看查询的性能。为了提高查找的性能,TokuDB在每个数据上加了一个forwardpointer,指向下一行中第一个比它大的数据的位置(这个叫做Fractional Cascading)。平均地看,上一级的每个数都把下一级搜索范围限制到了常数个,所以磁盘IO的次数最差应该为O(logN),好像这个没有问题。
        TokuDB实际采用的方法对上面的Fractal Trees还做了改进,将插入的IO数量级从log(N)提高到了logB(N),但是由于TokuDB不是开源的,所以没法知道他们如何实现的。

三、性能分析
        下面对TokuDB的性能总结一下,经过大家的测试,数据随机插入速度提高了20-80倍,如果是纯顺序插入则会慢3.5倍。
        如果要存储blob,不要使用TokuDB,因为它限制记录不能太大;
        如果你注重update的性能,不要使用TokuDB,它没有InnoDB快;
        如果你想要缩小数据占用的存储空间,可以使用TokuDB,TokuDB也进行了文件压缩。

引申阅读: 
4.  《Cache-ObliviousStreaming B-trees》