2013年11月27日星期三

Some Linux tricks

1. Error message "Invalid output format udev. Choose from value, > device, list, or full >" when installing kernel
    This is because system uses an old version of blkid, which doesn't recognized the newer udev format.
    To solve this, just install a new util-linux pacage. Sometimes yum screw up and have a lasted util-linux but blkid still not updated; just reinstall in this case: yum reinstall util-linux 

2013年11月18日星期一

MIGRATORY COMPRESSION

EMC

Coarse-grained Data Reordering to IMprove COmpressbility

compression get limited by its window size, larger window size leads to lsower compression

MC: first to coarse-grained reorganization to group similar blocks, so that you can improve compressibility with the same window size

Gzip: HUffman coding 64KB window
Bzip2: 900KB window, Burrows-WHeelses Transformation abd Huffman coding
7z: Markov chain-based range codec, 1GB window
(However, no apple to apple comprison as the don't know how to increase the window size of Gzip etc...)

Almost reverse linear relation between throughput and compression factor

Idea: improve compression by grouping similar blocks

Burrows-Wheelers: reorder within block
MC: reorder the blocks

Usecase:
1. compress a single, large file
2. Archival tier of deduplicating storage system
3. network transfers (maybe?)

Delta compression: just store how to transfor from A to A' in order to store A'. OK for single file, too complex for a deduplicating storage system, not good for locality neighter

Workflow of MC:
to compress:
segmentation (partition into blocks, calculate super fingerprints)----> similarity detector   -----> re-organizer
to un-compress:
header to specify how to re-construct file

Similarity Detection: similar to "reduendancy eleimation at block level", caluclating features by some set of rolling hashes.

Ororganization: reorganize input file, according to a recipe
1. in-memory, read in entire file
2. chunk-level: random access
3. multi-pass: in general better than chunk level

Evaluation:
At least for an in-memory ssystem, adding MC before compression never does worse than vanilla copression. If things don't fit in memory, not that much because of random accesses incurred. SSD helps though. --> Remzi: can you just limit the re-ordering scope so that it fits into memory?

Uncompression has 30% to 70% overhead in an archiveral system though, due to random access


Q: How about workload specific compression? As if you know the workload better, you may be able to compress better? MC is a pretty generic technique though?
A. I need to think about it.
     But NetAPP's vedio compression falls into that..














2013年11月15日星期五

SyNS'13 Session 6: Security

Resource-freeing Attacks: Improve Your Cloud Performance (at Your Neighbor's Expense)
Venkat Varadarajan, Mike Swift, Tom Ristenpart

Side channel stuff, take advantage of cpu scheduling etc.
Solution: better isolatin, pack-up VM and move (not all workloads cheap to move).

This work: greedy customer can get more resources by resource-freeing attack

Victim:
Beneficiary:
Helper: mounts the attack, could sit in beneficiary

Example:
Cache contention: victim webserver frequently interrupts, pollutes the cache for the beneficiary (because Xen gives higher priority to VM consuming less CPU time). Helper ask for cpu-intensive requests from the webserver, thus reduce the interrupts webserver generates, beneficiary improves

General Resource-freeing attacks:
1. Send targed requests to the victime
2. Shift resource usage away from the bottleneck

OpenNF: All your networkk functions are belong to us (almost)
Aaron Gember, Aditya Akella

NFs (or middleboxes) examine/odify packets at layers 3-7, and we have increasing deployment of software NFs.

But now NFs are treated as blackboxes, thus hard to identify bottlenecks etc.

Example:
If you scale up, you need to move flows, otherwise bottleneck may persist or you delay your scale up. But simply removing flow may lead to inconsistency. Move flows along with the internal states of the middleboxes.

Solution: OpenNF
APIs implemented by NFs so that applications can examine/update middlebox's internal states.
Problem of VM replication: unneededstat may cause incorrect actions, and cannot merge thus incapable of complex load balancing

NF state taxonomy:
Per-flow state, multi-flow state, all-flow state (statistics)

Per-flow state we can just move around,  multi-flow state we share (but synchronization expensive, so clone and lazy merging only when scaling down), all-flow state the sameas multipl-flow state. 

State-consistency: (because network could update states while we are moving things)
Suspend flow traffic and move one solution.
Packet redirect events a more efficient solution (Critical that you move the states before you update the network routing)


RiskRoute: A Framework for Mitigating Network Outage Threats
Ramakrishnan Durairajan, Paul Barford

Adjust internet routing before outage (by natural disasters) happen at rea ltime.

Does internet routing currently take advantage of the predictability of natural disasters?
Yes, but now done by hand, thus incomplete, in-efficient

Bit-Risk Miles Metric:
Assess sensitivity to network outages. Defined as
                     # of Bits Sent + Distance + Outage Risk = Bit-Risk Miles

RiskRoute methodology:
An optimization problem which minimize the bit-risk miles.

Evaluation:
1. Risk ratio: average reduction in bit-risk miles
2. Distance ratio: average increase in bit-miles.





SyNS'13 Session 5: Bugs

Toward Efficient, Portable Application-Level Crash Consistency
Thanu, Remzi

Many techniques for file system consistency (for metadata), but how about application data consistency (for user data)?

How does application achieve consistency: they rely on specific details of file system implementations.

Application-level invariants: thumbnails match pictures, etc. But hard to maintain over system crashes.

Example:
Atomic file update: create temp file, fsync temp file to disk, then use rename(). fsync() maybe left because of wrong understanding, performance sake, or because on most file systems (ext4), it is actually correct.

What to study application consistency behavior.
Methodology: case study of SQLite and LevelDB)

Properties offered by file systems: (Inspired by the disk properties sqlite rely on, I think)
1. Post-Crash property (true/false): does a system call sequence only result in given, desirable set of post-crash states? ext3-ordered: yes, ext3-writeback: yes, ext4-ordered: no, btrfs: yes
2. Ordered Appends (true, false): ext3-ordered: yes, ext3-writeback: no, ext4-ordered: no, btrfs: no
3. Ordered dir-ops: directory operations get persisted in issued order. 
4. Safe-appends: when a file is appended, the appended portion will never contain garbage
5. Safe new file: After fsync() on a new file, another fsync() on the parent directory not necessary. 

Bugs in SQLite and Level DB: all of them only exposed on certain file systems. And developer said that they don't understand file system behavior, so they had to be conservative for their implementation, thus hurting performance.

Performance:
Experiments show that 3x performance boosts if you can rel on certain file system properties to optimize application behaviour!

Future work:
Solutions which doesn't require application re-writing, system call re-ordering, maybe??? Tools to detect such bugs?


Efficient COncurrency-Bug Detection Across Inputs
Dongdong Deng, Shan Lu

Concurrency-bug detection is costly: need to take many inputs, each input has a huge interleaving space to test.  Software company cannot afford exhaustive in-house testing.

Idea:
Remove redundancy Across Inputs. Because multiple inputs could trigger the same bug. How could we remove such overlaps?

Solution:
Find the overlapping interleaving space for different inputs, given a set of inputs?
Which two inputs will give the same interleaving space? We cannot answer this question perfectly, but we can estimate. To this end, concurrent function paris (CFP) metric is proposed.

Characteristic study:
For each harmful data race, on average 7 inputs trigger it; for each benign data race, on average 4 inputs trigger it.

Approach:
1. Profile interleaving space of each input: look at functions instead of instructions, as only a few functions share memory accesses; thus look at all the function pairs which could be executed concurrently. If two inputs have similar concurrent function pairs, then they are likely to trigger the same bug.
    Naive way to detect CFP: look at every instruction in each function, see if they can run concurrently
    Efficient way: whether one function's entrance point can execute between the other's entrance and exit, by looking at locks and barriers.
2. A test plan of selecting inputs and functions.
    Based on the CFP information, use a greedy algorithm to select input which covers the most CFPs.
    Also, select functions which we test. 
3. Conduct bug detection.

Result:
4 times redundancy reduction thus speed up.


Limplock: Understandingthe Impact of Limpware on Scale-out Cloud Systems
Thanh Do, Haryadi Gunawi

Hardware fails in different ways, how about performance degradation as a failure manifest?
Facebook reports one slow NIC in one machine (1GB/s to 1 kbps) has cascading impact on a 100 node clusters. And many other stories.

Limpware: hardware whose performance is much lower than its specifications.
Study the impact of Limpware: (a single limpare has global impact, why?)

Methodology:
Run workload, inject limpware, then probe systems to understand symptoms.

Results:
1. Failures are handled pretty well (by retry etc.), but slowdowns are not!
2. Hadoop's speculative exec is not triggered. (E.g, a degraded NIC on a map node, but all reducers need to fetch data from that node. Speculative exec not triggered because all reducers are slow! In general, a degraded task leads to a degraded node, then leads to many degraded nodes)

Limplock:
 They system progresses slowly due to limware and is not capable
Level 1: Operation limplock
              caused by single point of failure etc.
Level 2: Node limplock
              caused by multi-purpose queue, unbounded queue, etc
Level 3: Cluster limplock
              
Limplock happens in all systems they analized.

Future work: Limpware-tolerant system
   1. Anticipation
   2. Detection
   3. Recovery

SyNS'13 Session 4: Measure it!

Inaccurate Spectrum Databases?
Tan Zhang, Suman

TV whitespaces are availbe, how do we use them?

Things people are already doing:
1. Rutal internet connection
2. Smart power-grid controller
3. Bus internet

But it shouldn't interfere primary TV users, thus need to determining whitespaces.

Wardriving on Public Vehicles (spectum sensor depoyed on Metro buses)

Accuracy of Commercial Databases:
1. Good for protecting used spaces
2. But too conservative thus waste while spaces.

V-scope: opportunistic
Wireless Gateway upload data to V-Scope Servers, then Serve uses internet to distribute information to participating databases.

Callenge in measuing Primary Signal:
Weak signals could be overwhalemed b strong signals ---> Zoom-in Peak detection (narrowing capture bandwidth to reduce noise floor).
Power ---> peak based power estimation.
Model refinement --> improve any propagation models by fitting its parameters with measurements. ---> weighted regression fitting to allow good fitting even for weak signals.


Enhancing the Quality of Electric Load Forecasting Methods
Scott ALfeld, Paul Barford

Predicting Electricity: compnents, buildings, cities, countries, days, years.

This work looks at buildings and cities, not single homes.

Problem:
Many, many buildings make up cities
Easily available: Historical electricity usage
Less so: Everything else (what if you have a new building, a new event?)

Data: hourly electriy usage for 2-3 years, NAICS Code ("business type")
Task: Predict tomorrow's (hourly) load

Solution:
Latent Feature Extraction
Anomaly Detection
Priority Encoding

Latent Feature Exatrction:
Solid, but unknown features: number of floors, surface area
Fuzzy and unknown features: sensitivy to humidity, behavior during holidays, ??? (something we didn't thought of)

Anomaly Detection:
Individual Anomalies (for one building): boiler upgrade, 50%-off sale
Shared Anomalies (for multiple buildings): three-day weekends, snowstorms

Priority Encoding:
Accuracy may not be the most important thing. Peak is much more important than total usage.
Peak prediction and Discontinuous penalties: MSE: 1/n* sigma(ti - yi)^2 doesn't work because peak error are more important for your prediction than errors which are made for free hours; you also don't want false positive peaks.
Challenges: need to aligin the peak with actual peak, but not a straight line at peak value! It needs to be easy to optimize on.
We don't have an solution for that yet....


Observing Home Wireless Experience through WIFI APs
Ashish Patro, Suman

A measurementinfrastrure which could capture the "wireless experience" in home WLANs.
Questions asked:
1. What is the wireless performance at any given moment?
2. How often does poor performance occur?
3. What are the causes of poor performance?

Wise infrastructure (wireless infrastructure of in-line sensing)
Data collected:
Signal strength of devices, air timeutilization, latencies. wifi packet statistics.
Time correlation analysis
Characterizing wireless performance:
   metric: only passive and corse local measurements, capture impact of links, provide app anoglistics
    Input: air time utilization, total contention, ????

Results:
Cause of poor performance
in building 1, when performance is poor, 60% of time we have high air time utilization and high packet loss, thus indicate channel congestion
in building 2, when performance is pool, latency is high and packet loss is high, indication a week signal source

Interference of Microwave oven:
Interference is short, but present in most buildings 

















New ENcryption Primitives for Uncertain TImes

Encryption is one of the few things that you can rely on.

But existing encryption primitives aren't always helping:
DPI can block protocols (like Tor) --> encrytpion tools are easy to detection ---> format transforming encryption
Dropbox has access to your data --> encryption doesn't allow de-duplication to save space --> message-locked encryption
Passord-based encryption often get bad passwords ---> de-encrpty yield garbage for wrong key, but meaningful things for correct key, thus good for brute force attacking--> Honey encryption

1. Encryption doesn't try to hide its persence
    Stegonagraphy: emmed bits into HTTP messages; but too slow
    Obsfproxy: encrypt all bits sent over network, but DPI will flag traffic as "????"
    Want:  Trick DPI to think its HTTP tranffic
     Can we build encryption which could fool regular expression matching?  ---> Format-transforming encryption: take clear text and a regular expression R, and the encrypted text is guaranteed to match R.
      Mechanism: Ranking for DFAs, but now for regexes (exponential complexity, but it's OK here)
      Can DPI adapt to detect FTE?
       1. Use R1, R2 against FTE --> flse positives
       2. FInd R that matches against FTE, but not legimate --> fast to change FTE formats
       3. Find non-regular checks ---> speed?
       4. Costum fight back? ----> ???

     FTE beats China's blocking of Tor, but Tor guys are skeptical about deploying an "academic" product (like they weren't an academic product. themselves....)

2. De-duplication doesn't work with conventional client-side encryption
    Prior work (in distributed system literature, surprisingly): convergent encryption (CE)
     Their solution:
        M --> KeyGen(M) ---> K (being used for encryption Enc(K,m)--> ?????
     Big idea: message is "shared  encrypted"
      But it has brute-force attaching problem.
       DupLESS is built to overcome the brute-force attaching issue of MLE

   Q: If we store the same data, then I know that you know (that I know...?)
   A: Yes, but this an orthogonal (side channel) issue with what we are looking at. 


3. Password-based encryption -> people don't pick good passward, you need to assume it is drawn from a small set
  
message + secrate key K ---> Cipher text C. Brute force attach possible by iterating K ( only one decrptyed text is not garbage)
Prior solution: PKCS5 standard, which do decryption b lots of hashing, thus attacker has to guess the right key each single time, but slow, and it onl slows down previous attack by constant factor.
Idea: increase the complexity of choosing the right key by producing plausible output if you use the wrong key to decrypt
Honey Encryption: same API as password-based encryption schemes, but use special encodings to ensure that decrypting ciphertext with *wrong*  key yield fresh sample from some target distribution (like an uniformation distribution in natural language).
Honey encryption for prime numbers: P (prime number) + K ---> C. Attacker can just try decrypt with each key and look for the prime numbers. Now you want an encoding so that every decrption of C yield a prime number.
           P --> (distribution transforming encoder) --->> Uniform bit-string ----> COnventional Encryption with K --> C
           Attacker: C --> decrypt with K' ---> Fresh Uniform bit string ---> Distribution transforming unencoder  ---> P'
      No attacker A can recover message with probability bettern than 1/q (the size of set where you draw password from)  no matter how long you runs 

Hopefully Next year we will have a general encryption scheme which covers all three above (because they are based on similar ideas anyway).

Q: Have you looked at other security real-world problems that you think could use an interesting encryption scheme, but you don't know what that encryption sheme yet?
A: Yes, all the time. LIke Gmail encryption, you lose spam detection, and we don't know how to solve it. In general, if you want data to be safe, but you also want third-party to do something about your data (de-duplication, spam detection), you have a problem.

SyNS'13 session 3: Fast & Safe

How is My Driving: Sensing Driving Behaviors by Android Devices
Lei Kang, Suman

Snapshot: hardware solutions to count driving miles and hard brakes (speed decrease mph per second). 30% discount for "good drivers).

No special hardware approach:
rating system based on sensors (acelerometer and gyroscope for acceleration, brake, ture, change lane)
Keep track of traffic and road conditions.

Challenges:
  1. People put their phone in different location
      Solutions: coodrinate projection to tranlsation the phone coordination in any rotation to the earth coordination. 

  2. Hard to defind good or bad hehavior
      Soluton: compare system rating and passenger rating

Related work:
Driver phone detection by car speakers(mobicom'11)
Driver phone detection by different central force reading on accelerometer(mobisys'13)
Distracted driver detection by mobile phone cameera (mobisys'13)

Q: Immediate feedback would be useful?
A: That is realtime detection of driving behavior. What we are doing is offline study, but online study wouldn’t be difficult.


Rinnegan: Efficient Resource Use for Heterogeneous Processors
Sankar, Mike Swift

Task scheduling: task can be run on different processing units (Parallel for GPU or CPU, encryption for CPU or AES). Not always good to use GPU because contention and data transfer overhead.
Power is limited: all units cannot be used at the high performance. Now we cap power usage pretty well, but don't distribute power smartly over different units

Rinnegan:
 Adaptive task scheduling, power awareness.
1. Adaptivity is achieved by OpenKernel, which expose resource utilization and allow application to make task placement decision. Don't do it transperantly in kernel because application knows best, and accelator can be accessed directly by apps.
    Impletment by accelerator stub, which predicts utlization, acceleartor agent, which implement scheduling decsions, enforce power limits and expose utilization informatio, and accelerator monitor, which publishes information exposed by agents to applications. 
2. Power Abstraction
  Power credits maintained by a central power center, power agent calculates power model and ask for power credites.




SyNS'13 session 2: Storage

Storage Class Memory Needs Flexible Interfaces
Sanketh Nalli, Michael Swift

Different SCM:
1. Phase-Change Memory (PCM) -- reads fast as memory writes fast as flash
2. Spin-transfer torque MRAM
3. Flash-backed DRAM
4. ???

Now it relies on FIle system because:
1. global naming
2. protected sharing (deletee open files, etc.)
3. Crash Consistency
Some use direct mapping for data, but go through FS for metadata

But name-space operations matter!

Causes of problems:
1. Trapping into the OS --- everything is a file
2. Fixed interface exposed by the file system

What if POSIX FS for latency app, and PROXY FS for new app?

Solution: Aerie (A user-mode file system which provides flexible interfaces)

SCM Manager: allocation, protection, addressing
LibFS: namespace implementation
Trusted Service: integrity and consistency

Two file system implemented, one POSIX-like, one Key-Value store like



Fault Isolation and QUick Recovery in Isolation File Systems
Lanyue Lu, Remzi

A small fault can cause global failure -- all files on a file system share a single fault domain.

Solution:
A new abstraction of fault domains. (Isolation FS)

File system failure behaviour study:
1. What global failurs are used?
     Read-only, crash. Over 67% of global failures will crash the file system
2. Causes of global failures??
     Metadata corruption, I/O failure, or software bug (null poiter, say). All global failures are caused by system states, not user data.

Current FS abstractions:
File and directory, namespace, partition(panic still take down whole OS, and inefficient)

Isolation File Systems:
  File Pod: an abstract partition, which contains a group of files and "related metadata"
   1. Isolate metadata for each file pod
   2. Covert global failure to local falures.
Impletation on ext3:
   1. A file pod contains multiple block groups. A pod id is inserted for each group descriptor. 
   2. Deframnation for groups.
   3. Journal isolation: virtual journal contains updates only from one pod

Q: Is there a case where you want global failures?


Optimistic Crash Consistency:
Vijay, Remzi

FIle system conflate ordering and durability, which makes consistency expensive.
They provide ofsync() which priovides ordering without durability, which leads to a high performance file system which provides strong consistency as ext4, just not much timeless. 

Techniques:
1. Disk notification when a block hits surface
2. transaction checksum to detect re-ordering in journal blocks
3. Some other techniques

Insights:
More and more layers between app and storage. Need interfaces that provides freedom for each layer but still have correctness.



SyNS'13 Session 1: Cloud

Next Stop, the Cloud: Understanding Modern Web Service Deployment in EC2 and Azure
Keqiang He, Aditya Akella

A global, in-depth understanding of lass cloud use. Questions:
1. Who is using public clouds? (cloud-using domains, traffic profile)
2. How are Web services using the cloud (impact of failure etc)

Datasets:  University packet capture and Alexa subdomain DNS records

Domain: 4% domains are using cloud, among those, 90% are using Amazon cloud.
Traffic volume: 81.7% go to Amazon, and 18.3% to WIndows Azure
Frond End: 8% PaaS front end, 4% Load Balancer frond end, 72% VM front end
Regions: 97% of the domains are using just one region ---> single region failure can take down a large fraction of cloud-using services
Availability Zone: 33% domains are using 1 availability zone, 40% are using two.

Highlights:
1. 4% popular services using EC2/Azure
2. VMs are most popular EC2 front end, value-add features used b top domains
3. LImited zone and region usage.

Question:
Q:Dropbox is dominating so much. If Dropbox is taken out of the picture, would the result still be the same?
A: For flow analysis, I don't know. For domain analysis, it would be the same.



Viewbox: Integrating Local File Systems with Cloud Storage Services
Yupu Zhang, Remzi Aparci-Desseau

Is your data really safe?  Not really in Dropbox even if you have many copies...
Corruption populate to other copies.
Crash inconsistency ends up everywhere
"Out-of-sync" synchronization: doesn't handle client crashes correctly.
Causality incosistency: files uploaded out of sync, and cloud state won't match an FS epoch.

Why:
Semantic gap between file systems and cloud. These are designed separtely.

Viewbox: view-based synchronization
Views: in-memory snapshots of the synchronizatin folder, crated at FS epoches (e.g, journal commit time)
One active view (to get FS changes), one frozen view (to sycn client),

Compontes of Veiwbox:
1. ext4-checksum  ---> corruption detection and inconsistency detection (like transation didn't commit)
2. ViewFS (an in-memor extention to ext4-checksum. Keeps an operation log to all the metadata, so it could later appy it to frozen view?)
3. Cloud-aid recovery.

Stratos: A Network-Aware Orchestration Layer for Virtual Middleboxes in the Cloud
Anand, Aditya

(This is the middlebox scaling Aditya was talking about last year)
State of the Art: in order to steer traffic to middles boxes, tenants need to overlay network on top of cloud network

Stratos:
Forwarding controller: tranlate policies of middlebox chains to forwarding rules.
    1. Split chains into subchains at mangling/connection-terminating middle boexies
    2. Calculate forwarding rules.
Data plane
Resource controller:
     Monitor end-to-end performance of chains, then decide resources on each chain. (Use flow distribution, horizaontal scaling, instance migration and scaling up middlebox, from lighter-weighted to heavier-weighted)






2013年11月13日星期三

Verification games: Making software verification fun

Michael Ernst, U Washington

Software engineering:
   mathematics (modeling, analysis), but there is an non-ideal component: people!

Crowd-sourced software verification.

Programming is like a game. But when is it not fun?

Code ---> automatically translated--->Game---->People palying--->Completed game ----> Automatic translated -----> Verified software (with proof/annotations)

Idea:
Encodes a constraint system for both code and game.


Example:
Code:Assiginment can only go from a type to its supertype.(variable: ball, assignment variable: pipe)
Code: null pointer errors (small ball: non null value, big ball: null vablue, pipe: dereference)
Game: ball can only fall into pipes wider than the ball
Cheats: cut the ball size to win unwinnable game (bug detected!)
Most accurate ituition: type constraints
Different pipes of the same color at different board: different use of the same variable
Different boards: different methods of a class
Level: a class
You need to solve all boards of one level at once. (Non-local reasoning)

Type flow vs. data flow:
Good for type flow (no loops, specif values, etc), not good for data flow

Other examples:
SQL injection
race conditions and deadlocks (if you use lock)
unintended side effetects

Why type systems for verification is good for games:
Modular, local reasoning and non-understanding

3-way collaboration: machines, players, verification experts
Machine optimization: simplify the challenge to its essence. (Abstract interpretation, constraint propagation, heuristic sovling) --- to make the game high level, playable. Too much details: player distracted, too little detail: unable to produce useful result.
Players: utilizing a community (20000 yrs people spent playing angry birds until 2011!), game design point. Same problem can have different skins --- different game.

Game requirment: abstraction (easy playng), modularity( different people work on different parts of the program)

Scoring System: Guide people to do more effective verification. In some sense, designing a good scoring system is the whole problem.


Why not covert everything into SAT and gamefy SAT:
Size explords and destroys problem structure thus no human intuition

Big questions:
How do we gamify software engineering, or even computer tasks in general?


SOSP'13 papers

Some useful links:
SOSP program page: http://sigops.org/sosp/sosp13/program.html (paper, slides, online questions and talks)
syslog: live from SOSP 2013: http://www.syslog.cl.cam.ac.uk/2013/11/04/live-blog-from-sosp-2013/




Tango: Distributed Data Structures over a Shared Log
Microsoft Research and Mircorsoft
Presented by Vijay

This is built on the same principle of log structure file system, but is distributed. It is built on SSD arrays thus random reads are fast (necessary for a log structure)

They keep a global sequence number to achieve serviceability. However, writes of different client and different sequence number will occur at different location of the log, thus could happen concurrently.
Transaction is implemented in an optimistic fashion:
A transaction starts with a transaction begin block in the shared log, followed by a read block which records the version number of all the objects this transaction is going to read (thus depend on), then followed by whatever updates this transaction made. Then every client could commit this transaction if the objects' version in the read block hasn't changed, without any coordination. If the objects in the read block has been updated, then the transaction just abort and restart again.

They claim that garbage collection is not a big problem because on flash cards you can do thousands of operations per second even with garbage collection going on.

This makes it easier to build distributed data service. They have built ZooKeeper in ~1000 lines of code on top of this. It is also likely to have higher performance because in a traditional distributed consistency protocol, communication overhead is high with many participating nodes, thus the good performance provided by SSD may  not be fully utilized.



The salable commutativaty rule: designing scalable software for multiple Processors (best paper)
Robert Morris, MIT CSAIL

Motivation:
Instead of relying on running workload to find scale bottleneck, we should look at interface design and its implementation.

Proved by simulation construction:
Whenever interface operations commute, they can be implemented in a way that scales.
      e.g., create with lowest fd doesn't commute, but create with any fd does commute
      Intuition behind: if operations commute, then the results are independent of order, thus you can implemented it in a communication-less way, thus scale well.

Commutativity are sensitive to operations, arguments, and states. (Thus more or less a dynamic thing..)

Commutor:
Take specification (e.g, POSIX, in the form of symbolic model) and implementation (e.g., Linux) as input, and output all scalability bottlenecks. It finds the bottleneck by firsty check the interface and analyze all communativity conditions (e.g, for rename, commute if both source file exist and name not the same, or neither exist, or both call self-renames, etc.) then exhaustively generate test for all possible pair of cases, see if they are conflict-free (i.e., no accessing same state), i.e., commute, and output these compute test. Finally, it will run these test cases on an implementation (say, Linux), and find for the commutative ops, is Linux generating any share states, or making them not commutative.

Results:
They use commutor on Linux and found 68% of the cases are noncommunicable. They then build sv6, removing most of the bottlenecks.

My comments:
1. How useful this thing is actually depends on the richness of the interface you provide. And here "interface" is used as a broad term, it also includes the underlying stats it might touch (say, inodes for renames), which you could argue is part of implementation...The trick is to give it enough details so that it doesn't report everything as scalable, and don't give too much details so that you give specific implementation details as interfaces.
2. The tile is "interface", yet they didn't seem to pin-point a place where POSIX interface is not scalable. (Well, maybe some small places, like create() with lowest fd instead any fd). In general, POSIX are pretty good in terms of scalability.
3. The graph the provide (figure 6 in the paper) are quite interesting. It reports conflicts, but didn't say how frequently those conflict happens in a real world though (say, I might have a global lock for a rare operation). But I guess you could argue that commutor just provides information, it is up to you to decide how important those conflict commutor finds are.
In general, this is a pretty novel paper and it certainly changed the way I think about scalability.


Toward Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior
Armando Solar-Lezama, MIT CSAIL

This is actually a PL related paper done by pure system people.

High level idea:
Undefined behavior in language causes problem because programmer and compiler operate under different assumptions.

Approach:
Since compiler usually optimize away unstable code (code which invokes undefined behavior), let's run the compiler (STACK) two times, once with the assumption that undefined behavior may be invoked, once with the assumption that undefined behavior would never be invoked. Compare the delta to find where compiler discard code, and thus find unstable code.

Limitation:
I think undefined behavior doesn't manifest them just in a way that compiler optimize away code. Sometimes it is just a hidden assumption when you implement the compiler. For example, you choose a particular order to evaluate the function arguments. This couldn't be found by this framework.

My comments:
1. Undefined behavior invocation is a manifestation of a gap of semantics/agreement between programmer and compiler. And this kind of gap happens not only in language compiling, but also in file system (eg. Thanumalayan Sankaranarayana Pillai's recent work), in cpu, and in all layers of systems where we have a stable API and people expect the lower layer to just work. This paper provides a framework to identify and analyze it in compiler, can we extend it to work for all layers? What can we borrow from their insights?
2. How hard is it to implement a compiler which doesn't use undefined behavior? Maybe I shall try it next semester in 706...




2013年11月1日星期五

RETCON: Transactional Repair without Replay

This is a paper from the architecture community (ISCA'10), and is solving transnational memory problem. But maybe it is useful in a journal file-system's context?

Inspired by selective replay [Sirnivasan+'04, Sarangi+'05,...]

Observation: main data are concurrent, but aux data are conflicted. Auxiliary data will serialize parallel operations, thus significantly reduce performance  (Example: hashtable's size etc.)

Basic idea: use symbolic tracking to ensure that if aux data conflicts, we repare, but the main data could be updated concurrently.


Approach:
1. Use symbolic values of input/output (keep track of what variables we are storing at each register, and what variables we are writing from register to the memory).
2. At commit: if somebody else changed the aux data's value when we are in our transaction, we just repair by update the aux data again, but the main data structure remain being updated concurrently.
2. Track constraint on inputs: if constraint is not violated, then not alternative code path needs to be taken (no change in control flow),  Then 1 and 2 would work just fine, but if the control flow changed, we need to abort the transaction and restart the execution.

How this could be applied to file system?

Related work:
•ReSlice [Sarangi+’05]
        Maintains insns in dependent slice of conflicting operation in TLS
        To repair, re-executes these instructions
•Dependence-aware transactional memory [Ramadan+’08]
        Forwards speculative values to optimize ordered communication
        Unlike RetCon, can’t handle conflicts with cyclic communication…
        But OTOH, can handle arbitrarily complex computation
•Advanced TM interfaces & data structure implementations
        Open nesting [Moss+’05,Moravan+’06,Ni+’07]
        Transactional boosting [Herlihy+’08]
         Abstract nesting [Harris+’07]
         Lock-free hashtables [Click’07,…]
         Scalable non-zero indicators [Ellen+’07]

external systmes research at Google

Bradley Chen
Principal Engineer, Google

He had an amazing paper at SOSP'93, which describes different OS structure impact on memory system performance, really showed how you could do interesting detailed measurements on systems.

Google refuses to put researchers in an lab, but on the field.

Ph.D teaches you a systematic way to approach creating new science, and to advance the state of the art.

Google's supprot for university research:
1. Faculty research program (for faculty)
2. Research fellow program (for Ph.Ds)
3. Own publications
4. Open source projects
5. Conference sponsorship

Some examples:
1. Fighting fraud on the internet (Stefan Savage, UCSD)
    Computer security is not just a technical problem, but it is really about people (Adversaries, victims,  defenders). Thus social-ecnomic factors come in.
     E-mail spam: complex value chain relationships, email programs, DNS, web-server, goods/service shipping, financial transactions etc. Measures all the parts. Found that most resources are cheap and profitable, but merchant banks are very few (actually, 3). Result: targeted payment intervention. (Really cool!!!) Remzi said: don't they worry about the Russian guy showing up in their labs?!
     Q by Remzi: maybe it's a bigger problem if drugs are cheaper overseas? Then you cannot really stop the brokers. 
      I need to read more about their methodology of doing these kind of measurement. They do collaborate with banks to track money flow though.

2. Integrating Circuit Switching into Datacenter (UCSD)
    Network doesn't scale very well to datacenter scale, because with larger bandwidth, switch comes with fewer ports, then you need a deeper stack of switches to connect the same cluster.
     Key Idea; Hybrid Circuit/Packet networks. (Actually it's just optical/packet switch network).
     Mordia: Circuit switching to the ToR---fast control plane (~100ms to ~100us), and fast OCS.
     Mordia network model: Taffic Matix(time). Advances from original hotspot scheduling, they make reconfiguration much faster by 1. faster hardware, 2, faster way to observe the network using matrix. TM to represent traffice needs ---> TM' to represent bandwidth allocation, then decompose TM' into scheduler (I didn't really understand how though....)
   Faster hardware: 3D mirror setting to a 2D mirror setting. reconfiguration time redues to ~100 us.
   Remzi: Google stopped saying that network is their bottleneck anymore in their papers (like the earlier papers such as GFS), doesn't that mean Google has new ways to do networking?

3. Disciplined Approximate Computing.
    When you convert analog to digital,it is just an approximation. But you need to be aware when approximation happens, instead of doing it recklessly.
    "Diciplined" Approximate programming: decide when approximation can be used. They have languague and architecture support, which allows you to specify when approximation can happen. A didtial NPU: compute as a neuron.
     results: 2x speed up, 3x energy reduce, and less quality loss.

Q: When Google fund a project, is it just based on merits, or is it also based on other funding sources which might be availabe to this project.
A: Vritually all the projects Google likes have some other sources of funding, like NSF.
Q: Does Google do lobbying on research a lot?
A: Google is doing more and more on that, for better or for worse.  Actually UK or Switzerland is a lot robust in terms of supporting research, but Russia or France is much worse than US.