2013年12月13日星期五

Mahesh Balakrishnan's recent work on SSD-cache

This guy works in Microsoft Research, and has done a lot of work which use SSD as a cache...

Gecko: Contention-Oblivious Disk Arrays for Cloud Storage
FAST'13

Motivation:
Multiple VMs issuing I/O workload to underlying storage, which mixes up sequential and random accesses, or different sequential streams, causing disk to see only random accesses, thus degrade I/O performance severely.

Inadequate solutions:
Re-ordering I/O: increase latency, in-complete
Workload placement: requires prediction, maybe inaccurate, not flexible
LFS: garbage collection kills performance

Observation:
A single sequentially accessed disk is better than multiple randomly seeking disks.

Solution:
Instead of striping, they chain multiple disks into a cycling log. Log-structure style write always happen in the tail; do garbage collection in the head, thus separate random and sequential accesses streams. Reads could still happen all over the disks though, they have an SSD as cache to alleviate read problems and speed up writes. Metadata, such as reverse index etc., are also put in SSD.

Performance:
They have consistent performance, ~100 MB/s. Traditional stripping usually get 200 MB/s, but during garbage collection (about 30% -50% of the time) would be below 50 MB/s

My comments:
Another way to solve this multiple I/O workload mingled together problem is to have VMs collaborate with each other. VMWare's mClock (OSDI'10) and Padra (FAST'09) has a little bit this flavor, but focused on fair sharing instead of sentimentality optimization. Might be an interesting problem to look at...



Tango: Distributed Data Structures over a Shared Log
Microsoft Research and Mircorsoft
Presented by Vijay

This is built on the same principle of log structure file system, but is distributed. It is built on SSD arrays thus random reads are fast (necessary for a log structure)

They keep a global sequence number to achieve serviceability. However, writes of different client and different sequence number will occur at different location of the log, thus could happen concurrently.
Transaction is implemented in an optimistic fashion:
A transaction starts with a transaction begin block in the shared log, followed by a read block which records the version number of all the objects this transaction is going to read (thus depend on), then followed by whatever updates this transaction made. Then every client could commit this transaction if the objects' version in the read block hasn't changed, without any coordination. If the objects in the read block has been updated, then the transaction just abort and restart again.

They claim that garbage collection is not a big problem because on flash cards you can do thousands of operations per second even with garbage collection going on.

This makes it easier to build distributed data service. They have built ZooKeeper in ~1000 lines of code on top of this. It is also likely to have higher performance because in a traditional distributed consistency protocol, communication overhead is high with many participating nodes, thus the good performance provided by SSD may  not be fully utilized.

2013年12月1日星期日

Bounded gap between primes

Here is a nice summary of Yitang Zhang's talk:

Now they have locked the number to 16. However, it is hard, if not impossible, to extend Zhang's method to prove twin prime conjecture, due to the parity problem of sieve theory.

2013年11月27日星期三

Some Linux tricks

1. Error message "Invalid output format udev. Choose from value, > device, list, or full >" when installing kernel
    This is because system uses an old version of blkid, which doesn't recognized the newer udev format.
    To solve this, just install a new util-linux pacage. Sometimes yum screw up and have a lasted util-linux but blkid still not updated; just reinstall in this case: yum reinstall util-linux 

2013年11月18日星期一

MIGRATORY COMPRESSION

EMC

Coarse-grained Data Reordering to IMprove COmpressbility

compression get limited by its window size, larger window size leads to lsower compression

MC: first to coarse-grained reorganization to group similar blocks, so that you can improve compressibility with the same window size

Gzip: HUffman coding 64KB window
Bzip2: 900KB window, Burrows-WHeelses Transformation abd Huffman coding
7z: Markov chain-based range codec, 1GB window
(However, no apple to apple comprison as the don't know how to increase the window size of Gzip etc...)

Almost reverse linear relation between throughput and compression factor

Idea: improve compression by grouping similar blocks

Burrows-Wheelers: reorder within block
MC: reorder the blocks

Usecase:
1. compress a single, large file
2. Archival tier of deduplicating storage system
3. network transfers (maybe?)

Delta compression: just store how to transfor from A to A' in order to store A'. OK for single file, too complex for a deduplicating storage system, not good for locality neighter

Workflow of MC:
to compress:
segmentation (partition into blocks, calculate super fingerprints)----> similarity detector   -----> re-organizer
to un-compress:
header to specify how to re-construct file

Similarity Detection: similar to "reduendancy eleimation at block level", caluclating features by some set of rolling hashes.

Ororganization: reorganize input file, according to a recipe
1. in-memory, read in entire file
2. chunk-level: random access
3. multi-pass: in general better than chunk level

Evaluation:
At least for an in-memory ssystem, adding MC before compression never does worse than vanilla copression. If things don't fit in memory, not that much because of random accesses incurred. SSD helps though. --> Remzi: can you just limit the re-ordering scope so that it fits into memory?

Uncompression has 30% to 70% overhead in an archiveral system though, due to random access


Q: How about workload specific compression? As if you know the workload better, you may be able to compress better? MC is a pretty generic technique though?
A. I need to think about it.
     But NetAPP's vedio compression falls into that..














2013年11月15日星期五

SyNS'13 Session 6: Security

Resource-freeing Attacks: Improve Your Cloud Performance (at Your Neighbor's Expense)
Venkat Varadarajan, Mike Swift, Tom Ristenpart

Side channel stuff, take advantage of cpu scheduling etc.
Solution: better isolatin, pack-up VM and move (not all workloads cheap to move).

This work: greedy customer can get more resources by resource-freeing attack

Victim:
Beneficiary:
Helper: mounts the attack, could sit in beneficiary

Example:
Cache contention: victim webserver frequently interrupts, pollutes the cache for the beneficiary (because Xen gives higher priority to VM consuming less CPU time). Helper ask for cpu-intensive requests from the webserver, thus reduce the interrupts webserver generates, beneficiary improves

General Resource-freeing attacks:
1. Send targed requests to the victime
2. Shift resource usage away from the bottleneck

OpenNF: All your networkk functions are belong to us (almost)
Aaron Gember, Aditya Akella

NFs (or middleboxes) examine/odify packets at layers 3-7, and we have increasing deployment of software NFs.

But now NFs are treated as blackboxes, thus hard to identify bottlenecks etc.

Example:
If you scale up, you need to move flows, otherwise bottleneck may persist or you delay your scale up. But simply removing flow may lead to inconsistency. Move flows along with the internal states of the middleboxes.

Solution: OpenNF
APIs implemented by NFs so that applications can examine/update middlebox's internal states.
Problem of VM replication: unneededstat may cause incorrect actions, and cannot merge thus incapable of complex load balancing

NF state taxonomy:
Per-flow state, multi-flow state, all-flow state (statistics)

Per-flow state we can just move around,  multi-flow state we share (but synchronization expensive, so clone and lazy merging only when scaling down), all-flow state the sameas multipl-flow state. 

State-consistency: (because network could update states while we are moving things)
Suspend flow traffic and move one solution.
Packet redirect events a more efficient solution (Critical that you move the states before you update the network routing)


RiskRoute: A Framework for Mitigating Network Outage Threats
Ramakrishnan Durairajan, Paul Barford

Adjust internet routing before outage (by natural disasters) happen at rea ltime.

Does internet routing currently take advantage of the predictability of natural disasters?
Yes, but now done by hand, thus incomplete, in-efficient

Bit-Risk Miles Metric:
Assess sensitivity to network outages. Defined as
                     # of Bits Sent + Distance + Outage Risk = Bit-Risk Miles

RiskRoute methodology:
An optimization problem which minimize the bit-risk miles.

Evaluation:
1. Risk ratio: average reduction in bit-risk miles
2. Distance ratio: average increase in bit-miles.





SyNS'13 Session 5: Bugs

Toward Efficient, Portable Application-Level Crash Consistency
Thanu, Remzi

Many techniques for file system consistency (for metadata), but how about application data consistency (for user data)?

How does application achieve consistency: they rely on specific details of file system implementations.

Application-level invariants: thumbnails match pictures, etc. But hard to maintain over system crashes.

Example:
Atomic file update: create temp file, fsync temp file to disk, then use rename(). fsync() maybe left because of wrong understanding, performance sake, or because on most file systems (ext4), it is actually correct.

What to study application consistency behavior.
Methodology: case study of SQLite and LevelDB)

Properties offered by file systems: (Inspired by the disk properties sqlite rely on, I think)
1. Post-Crash property (true/false): does a system call sequence only result in given, desirable set of post-crash states? ext3-ordered: yes, ext3-writeback: yes, ext4-ordered: no, btrfs: yes
2. Ordered Appends (true, false): ext3-ordered: yes, ext3-writeback: no, ext4-ordered: no, btrfs: no
3. Ordered dir-ops: directory operations get persisted in issued order. 
4. Safe-appends: when a file is appended, the appended portion will never contain garbage
5. Safe new file: After fsync() on a new file, another fsync() on the parent directory not necessary. 

Bugs in SQLite and Level DB: all of them only exposed on certain file systems. And developer said that they don't understand file system behavior, so they had to be conservative for their implementation, thus hurting performance.

Performance:
Experiments show that 3x performance boosts if you can rel on certain file system properties to optimize application behaviour!

Future work:
Solutions which doesn't require application re-writing, system call re-ordering, maybe??? Tools to detect such bugs?


Efficient COncurrency-Bug Detection Across Inputs
Dongdong Deng, Shan Lu

Concurrency-bug detection is costly: need to take many inputs, each input has a huge interleaving space to test.  Software company cannot afford exhaustive in-house testing.

Idea:
Remove redundancy Across Inputs. Because multiple inputs could trigger the same bug. How could we remove such overlaps?

Solution:
Find the overlapping interleaving space for different inputs, given a set of inputs?
Which two inputs will give the same interleaving space? We cannot answer this question perfectly, but we can estimate. To this end, concurrent function paris (CFP) metric is proposed.

Characteristic study:
For each harmful data race, on average 7 inputs trigger it; for each benign data race, on average 4 inputs trigger it.

Approach:
1. Profile interleaving space of each input: look at functions instead of instructions, as only a few functions share memory accesses; thus look at all the function pairs which could be executed concurrently. If two inputs have similar concurrent function pairs, then they are likely to trigger the same bug.
    Naive way to detect CFP: look at every instruction in each function, see if they can run concurrently
    Efficient way: whether one function's entrance point can execute between the other's entrance and exit, by looking at locks and barriers.
2. A test plan of selecting inputs and functions.
    Based on the CFP information, use a greedy algorithm to select input which covers the most CFPs.
    Also, select functions which we test. 
3. Conduct bug detection.

Result:
4 times redundancy reduction thus speed up.


Limplock: Understandingthe Impact of Limpware on Scale-out Cloud Systems
Thanh Do, Haryadi Gunawi

Hardware fails in different ways, how about performance degradation as a failure manifest?
Facebook reports one slow NIC in one machine (1GB/s to 1 kbps) has cascading impact on a 100 node clusters. And many other stories.

Limpware: hardware whose performance is much lower than its specifications.
Study the impact of Limpware: (a single limpare has global impact, why?)

Methodology:
Run workload, inject limpware, then probe systems to understand symptoms.

Results:
1. Failures are handled pretty well (by retry etc.), but slowdowns are not!
2. Hadoop's speculative exec is not triggered. (E.g, a degraded NIC on a map node, but all reducers need to fetch data from that node. Speculative exec not triggered because all reducers are slow! In general, a degraded task leads to a degraded node, then leads to many degraded nodes)

Limplock:
 They system progresses slowly due to limware and is not capable
Level 1: Operation limplock
              caused by single point of failure etc.
Level 2: Node limplock
              caused by multi-purpose queue, unbounded queue, etc
Level 3: Cluster limplock
              
Limplock happens in all systems they analized.

Future work: Limpware-tolerant system
   1. Anticipation
   2. Detection
   3. Recovery

SyNS'13 Session 4: Measure it!

Inaccurate Spectrum Databases?
Tan Zhang, Suman

TV whitespaces are availbe, how do we use them?

Things people are already doing:
1. Rutal internet connection
2. Smart power-grid controller
3. Bus internet

But it shouldn't interfere primary TV users, thus need to determining whitespaces.

Wardriving on Public Vehicles (spectum sensor depoyed on Metro buses)

Accuracy of Commercial Databases:
1. Good for protecting used spaces
2. But too conservative thus waste while spaces.

V-scope: opportunistic
Wireless Gateway upload data to V-Scope Servers, then Serve uses internet to distribute information to participating databases.

Callenge in measuing Primary Signal:
Weak signals could be overwhalemed b strong signals ---> Zoom-in Peak detection (narrowing capture bandwidth to reduce noise floor).
Power ---> peak based power estimation.
Model refinement --> improve any propagation models by fitting its parameters with measurements. ---> weighted regression fitting to allow good fitting even for weak signals.


Enhancing the Quality of Electric Load Forecasting Methods
Scott ALfeld, Paul Barford

Predicting Electricity: compnents, buildings, cities, countries, days, years.

This work looks at buildings and cities, not single homes.

Problem:
Many, many buildings make up cities
Easily available: Historical electricity usage
Less so: Everything else (what if you have a new building, a new event?)

Data: hourly electriy usage for 2-3 years, NAICS Code ("business type")
Task: Predict tomorrow's (hourly) load

Solution:
Latent Feature Extraction
Anomaly Detection
Priority Encoding

Latent Feature Exatrction:
Solid, but unknown features: number of floors, surface area
Fuzzy and unknown features: sensitivy to humidity, behavior during holidays, ??? (something we didn't thought of)

Anomaly Detection:
Individual Anomalies (for one building): boiler upgrade, 50%-off sale
Shared Anomalies (for multiple buildings): three-day weekends, snowstorms

Priority Encoding:
Accuracy may not be the most important thing. Peak is much more important than total usage.
Peak prediction and Discontinuous penalties: MSE: 1/n* sigma(ti - yi)^2 doesn't work because peak error are more important for your prediction than errors which are made for free hours; you also don't want false positive peaks.
Challenges: need to aligin the peak with actual peak, but not a straight line at peak value! It needs to be easy to optimize on.
We don't have an solution for that yet....


Observing Home Wireless Experience through WIFI APs
Ashish Patro, Suman

A measurementinfrastrure which could capture the "wireless experience" in home WLANs.
Questions asked:
1. What is the wireless performance at any given moment?
2. How often does poor performance occur?
3. What are the causes of poor performance?

Wise infrastructure (wireless infrastructure of in-line sensing)
Data collected:
Signal strength of devices, air timeutilization, latencies. wifi packet statistics.
Time correlation analysis
Characterizing wireless performance:
   metric: only passive and corse local measurements, capture impact of links, provide app anoglistics
    Input: air time utilization, total contention, ????

Results:
Cause of poor performance
in building 1, when performance is poor, 60% of time we have high air time utilization and high packet loss, thus indicate channel congestion
in building 2, when performance is pool, latency is high and packet loss is high, indication a week signal source

Interference of Microwave oven:
Interference is short, but present in most buildings