2013年12月13日星期五

Mahesh Balakrishnan's recent work on SSD-cache

This guy works in Microsoft Research, and has done a lot of work which use SSD as a cache...

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage
FAST'13

Motivation:
Multiple VMs issuing I/O workload to underlying storage, which mixes up sequential and random accesses, or different sequential streams, causing disk to see only random accesses, thus degrade I/O performance severely.

Inadequate solutions:
Re-ordering I/O: increase latency, in-complete
Workload placement: requires prediction, maybe inaccurate, not flexible
LFS: garbage collection kills performance

Observation:
A single sequentially accessed disk is better than multiple randomly seeking disks.

Solution:
Instead of striping, they chain multiple disks into a cycling log. Log-structure style write always happen in the tail; do garbage collection in the head, thus separate random and sequential accesses streams. Reads could still happen all over the disks though, they have an SSD as cache to alleviate read problems and speed up writes. Metadata, such as reverse index etc., are also put in SSD.

Performance:
They have consistent performance, ~100 MB/s. Traditional stripping usually get 200 MB/s, but during garbage collection (about 30% -50% of the time) would be below 50 MB/s

My comments:
Another way to solve this multiple I/O workload mingled together problem is to have VMs collaborate with each other. VMWare's mClock (OSDI'10) and Padra (FAST'09) has a little bit this flavor, but focused on fair sharing instead of sentimentality optimization. Might be an interesting problem to look at...



Tango: Distributed Data Structures over a Shared Log
Microsoft Research and Mircorsoft
Presented by Vijay

This is built on the same principle of log structure file system, but is distributed. It is built on SSD arrays thus random reads are fast (necessary for a log structure)

They keep a global sequence number to achieve serviceability. However, writes of different client and different sequence number will occur at different location of the log, thus could happen concurrently.
Transaction is implemented in an optimistic fashion:
A transaction starts with a transaction begin block in the shared log, followed by a read block which records the version number of all the objects this transaction is going to read (thus depend on), then followed by whatever updates this transaction made. Then every client could commit this transaction if the objects' version in the read block hasn't changed, without any coordination. If the objects in the read block has been updated, then the transaction just abort and restart again.

They claim that garbage collection is not a big problem because on flash cards you can do thousands of operations per second even with garbage collection going on.

This makes it easier to build distributed data service. They have built ZooKeeper in ~1000 lines of code on top of this. It is also likely to have higher performance because in a traditional distributed consistency protocol, communication overhead is high with many participating nodes, thus the good performance provided by SSD may  not be fully utilized.

2013年12月1日星期日

Bounded gap between primes

Here is a nice summary of Yitang Zhang's talk:

Now they have locked the number to 16. However, it is hard, if not impossible, to extend Zhang's method to prove twin prime conjecture, due to the parity problem of sieve theory.