Yang Suli的Blog

2013年3月3日星期日

CA-NFS: A Congestion-Aware Network File System

FAST'2011 NetApp

Key Idea:
1. Make NFS client behave better to improve distributed system performance.
2. They define congestion price in a device independent way based on utilization (of disk, cpu, network, and even virtual devices such as readahead effectivess serving as heuristics).-- ref.: “Throughput-competitive online routing” , FOCS ʼ93
3. The servers and the clients then use the price as a way to coordinate and schedule file system operations.

Networking Abstractions (in a cloud computing era)

Coflow: A Networking Abstractions for Cluster Application
HotNets'2012 UC-Berkeley (Same guys behind the "Resilient Distributed Datasets")

Key Ideas/Takeaway:
1. The completion time of a cluster application depends more on the fate of a collection of flows, instead of individual flows.
2. Most application needs can be expressed in terms of minimizing completion time or meeting deadlines. (really?)
3. Flows can be decoupled in time (by using storage) and in space (by using broadcast or multicast). --- Note: this is an example of network using storage!
4. Typical cluster application dataflow patterns: mapreduce, dataflow with barriers (multi-stage map-reduce, e.g., pig), dataflow without explicit barriers (dryad), dataflow with cycles (spark), bulk synchronous parallel (parallel scientific computing), partition-aggregate( google search engine).

API:
four player: driver (cluster coordinator, or cloud controller), sender, receiver, network
create(pattern,[options]) => coflow handle, called by driver, pattern may be shuffle, broadcast, aggregation etc.
update(handle, [options]) => result, called by driver
put(handle, flow id, content, [options]) => result, called by sender
get(handle, flow id, [options]) => content, called by receiver
terminate(handle,[options]) => result, called by driver

Underlying Assumption:
coflow assumes a fixed set of senders and receivers, the driver has to determine them without network participation; i.e., they exclude the possibility that the network determines where to put replications, etc. This might be mitigated by using candidate senders/receivers, I am not quite sure.
"co-flow comes into action once you have already determined where your end-points are located. If the decision of end-point placement is not good one, there is only a limited opportunity"

Questions:
1. How does this work in a virtualized environment?
2. How does the cloud controller coordinate multiple coflows? They proposed sharing (reservation based), prioritization and ordering. This gives the cloud controller a way to allocate (the abstracted) network resources? What is the implication?
3. How does network coordinate requests from multiple cloud controllers: not explicit in paper.
4. What network topology is presented to the cloud controller? The real topology? Rings?
5. How does this framework handles the situation where network is not the bottleneck? It has to dynamically interact with computation unit and storage unit?

Programming Your Network at Run-time for Big Data Applications
HotSDN'2012 IBM Watson, Rice University

Key ideas:
1. Application manager sends a traffic demand matrix to the SDN controller, which in turn use this information to optimize network (using optical switch to setup topology, etc.). This traffic demand matrix is estimated using application level knowledge.
2. Application manager, based on the knowledge that it is operating on an optical switch enabled network, could do some simple optimizations, e.g., aggregate reducers in the same rack, submit requests to SDN in batch. Here application manager uses two pieces of network information: it is optical switch, and which nodes are in the same rack.
3. Used the traffic matrix (?), they argued for some efficient implementation using optical switch on some particular communication patterns, e.g., aggregation, shuffling, or overlapped aggregation.

Comments:
1. This work is based on rack granularity, because only ToR has optical links.
2. Another paper mixed API and implementations.......=.=
3. I am really sick of Hadoop...-__-b!!!

Questions:
1. What is the network topology presented? A ring per rack?

Fabric: A Retrospective on Evolving SDN
HostSDN'2012, Nicira and UC Berkeley, Scott Shenker

This paper advocates for a router/switch chassis style implementation of network

Before the paper:
1. I love Scott Shenker!!!! I almost agree on every word he said about SDN!
2. We definitely need some layer 2.5 addressing, since both IP and MAC have fundamental deficiencies.

What's the API of network:
1. Host --- Network interface
Host asks the network to send their packets, along with QoS requirements.
Currently it is done by packet header

2. Operator -- Network interface
Operators give requirements and decisions of the network operation to the network
Currently it is done by box-to-box router configuration. SDN provides a more programmable interface of it, be decouping the distribution model of the control plane to the topology of te data plane

3. Packet -- Switch interface
How a packet identifies itself to a switch. The switch then uses this piece of information to do forwarding thus actually implementing the connectivity of the netowrk.
Currently this is done by packet header

The problem is that we don't currently distinguish the Host-Netowrk interface and the Packet-Switch interface, thus unnecessarily couples network service implementations (isolation, security, etc.) and core connectivity implementation.

The fabric architecture:
1. hosts, which asks for network service
2. edge switches, which implement network services, using current header and protocol (e.g., IPv4)
3. core fabrics, which implement network connectivity, using its own label potentially (like MPLS)

This is very much like the internal architecture of a modern switch
(Two versions of) SDN should be introduced separately to edges for service management (complex), and to core for connectivity management (very basic).

Meso: Fine Grain Resource Sharing in Data Center
NSDI'2010, UC Berkey, Scott Shenker and Ion Stoica

Key Idea: Resource Offers
In stead of doing scheduling, make resource offers and push scheduling decisions to the framework applications. E.g., Meso offers two nodes with 8GB Ram, Hadoop decides wthether to take this offer and which task to launch on it.
A more traditional approach would be applications express its needs in a (specially designed) language and a central scheduler schedules based on these needs. But what if application has needs which can't be expressed in such a language? Also, Hadoop already has scheduling logic, why not utilize it?

The Datacenter Needs an Operating System
HotOS'2011, UC Berkeley, Scott Shenker and Ion Stoica

Think the datacenter as the new computer, and think the datacenter infrastructure problem from an OS perspective.

A datacenter OS needs to provide:
1. Resource sharing
Hadoop already does scheduling between jobs.
Unsolved: inter-framework sharing, sharing the network, independent services, and virtualization

2. Data sharing
Currently done in the form of distributed file system
Unsolved: standardized interfaces (like VFS?), performance isolation, etc.

3. Program abstractions
including communication primitives

4. Debugging and Monitoring

Questions:
If we think of Hadoop as a form of data center OS, where does it fall short?

Location, Location, Location! Modeling Data Proximity in the Cloud
HotNets'2010, MSR and U Mich

Key Idea:
Insert a layer (which they call Contour) between application and key-value store, which report to the application the latency of accessing a particular key.
To calculate this, key-value store report to Contour a replication topology for each key, Contour combined this information with the network latency etc. to calculate update latency.
It suffers from security problem as revealing too much details about the storage layer to the application.

2013年2月21日星期四

Higgs might be doom for the entire universe?

There seems to be some media heat on the claim that our universe could be swallowed by an alternate one based
on calculation using the current Higgs mass.

“At some point, billions of years from now, it’s all going to be wiped out…. The universe wants to be in a different state, so eventually to realise that, a little bubble of what you might think of as an alternate universe will appear somewhere, and it will spread out and destroy us,” Lykken said at AAAS.

This is based on a renormalization group calculation extrapolating the Higgs effective potential to its value at energies many many orders of magnitude above LHC energies. To believe the result you have to believe that there is no new physics and we completely understand everything exactly up to scales like the GUT or Planck scale. Fan of the SM that I am, that’s too much for even me to swallow as plausible.

Commented by Peter Woit

2013年2月8日星期五

HARDEN FS (selective 2-verisoning on HDFS)

common solution to deal with fail-silent failure:

1. using repliated state machine
2. n-versioning programming

Main idea of HARDFS (selective 2-versioning):

0. better crash than lie! Thus keep watching and whenever somebody is doing something wrong, either recover or just kill them.
0. make use of the fact that the system is already robust, and being able to recover from a lot of failures (e.g., crashes)
1. selective (only replicate important state)
2. use bloom filters to compactly encode states (i.e, all file system states are encoded in terms of yes-or-no questions) -- they use a particular kind of boom filter which supports deletion
3. ask-then-check for unboolean verification???

Evaluation of HARDFS:
1. detect bit-flip error pretty well, more crashes because more bookkeeping (better crash than lie!)
(how well it is on more realistic/correlated errors is still unknown -- butt he did do experiments which shows it protects bugs from mozzila bug report?)
2. 3% space overhead, 8% performance overhead, 12% additional code -- because only a part of state is replicated and a part of code is 2-versioned.

More details on HARDFS:

selected part:

harden namespace management,

harden replica management

harden read/write protocol

micro-recovery

second verison watches input/output

node behavior model

state manager:

state manager to replicate subset of states (need to understand HDFS semantics)

use bloom filters because it does boolean verification well

to update bloom filter state (ask main version for values and check with the 2nd version bloom filter)

actively modified states in concrete form to enable in-place updates -- to avoid CPU overhead

false positive in bloom filter only results in unnecessary recovery (as long as faults are transitive, non-deterministic)

action verifier:

four types of wrong actions:

corrupt

missing

orphan

out of order

Handling disagreement:

using domain knowledge to ignore false alarms

for true alarms, I think they just re-start system using on-disk and other node's states

Recovery:

1. crash and reboot (expensive)

2. micro-recovery (pin-pointted corrupted state by comparing states of two verisons, then only reconstrute corrupted state from disk)

however needs to remove corrupted state in bloom-filters

solution: new bloom filter to start over, and add all right states to that new bloom filters

3. thwarting destructive instructions

???? master just tell node?????

2013年2月7日星期四

Some computer science career tips

On Ph.D
http://da-data.blogspot.com/2013/01/on-phd.html

Computer Science Postdocs: Best Practice

http://cra.org/resources/bp-view/best_practices_memo_computer_science_postdocs_best_practices/?utm_source=Computing+Research+News&utm_campaign=6640b57cf7-February_2013_CRN

How to get a faculty job, Part 1: The application, interview and offer

http://matt-welsh.blogspot.com/2012/12/how-to-get-faculty-job-part-1.html

http://matt-welsh.blogspot.com/2012/12/how-to-get-faculty-job-part-1b-how-to.html

http://matt-welsh.blogspot.com/2012/12/how-to-get-faculty-job-part-2-interview.html

http://matt-welsh.blogspot.com/2013/01/how-to-get-faculty-job-part-3.html

2013年2月5日星期二

Linux File System Evolution

pages.cs.wisc.edu/~ll/fs-patch

study linux bug report to understand:
1. what patches do?
2. what type of bugs?
3. what techniques are used to improve performance/reliability?

patches = 40% bug fixes + 9% reliability/performance improment + 10% features + other maintainance

bug fixes = 60% semantic bugs + 10% concurrency bugs + 10% error code bug + 10% memory bug

stable file system doesn't mean bugs will become fewer (feature adding, etc.)

bug consequence = 40% corruption + 20% crash + 10% error + 10% wrong behaviors + others(leak, hang, etc.)
as opposed in application (user space) bugs: wrong behavior dominates

transaction code may introduce a large number of bugs, tree is not really error-prone

40% of bugs happen on failure paths (people don't handle failure well)

performance improvment = 25% sync optimaztion + 25% access optimzaiton + 25 scheduling optimzation

2013年2月4日星期一

GPUfs: GPU integrated file system

Mark from UT-Austin

argue for more suitable abstractions for accelerators

accelerators != co-processors (shouldn't be handled as such at software layer) -- even though in the hardware gpu is a co-processor --- it can not open files by itself (as it cannot interrup into host hardwares)

thus On-accelerator OS support + Accelerator applications

In current model, GPU is a co-processor, and you have to manually do double-buffering, pipelineing, etc. i.e., too much low level details exposed. (9 cpu loc for 1 gpu loc)

gpu has thre leves of paralleism:

1. mulitple cores on gpu

2. muliple context on a core (to compensent latency of memory access)

3. SIMD vector parallelism, i.e., mulitple ALUs via data parallesim

gpu can access its local memory 20x ( 200GB/s v.s 16GB/s) faster than accesing cpu memory, consistency is also compromised

file system API design:

1. disallow threads in the same SIMD group to open different files Iall of them collaboratively execute the same API call) --- to avoid divergence problem

2. gopen() is cached on GPU (same file descriptor for the same file, i.e., offset shared, thus read/write has too specify offset explicitly) gread()=pread(), gwrite()=pwrite()

3. When to sync? can't do it asynchrounsly because gpu can't have preemptive threads, and cu polling is too ineffcient. so they require sync explicity (otherwise data never gets to disk!)

GPUfs design:

1. system wide buffer cache : AFS close(sync)-to-open consistency semantics (but per block instead of per file)

2. buffer page false sharing (when gpu and cpu writing to different offsets of the same page)

3. client(gpu)-server(cpu) architecture based RPC system

4. L1 bypassing is needed to do the whole thing

Q&A:

1. why file system abstraction instead of shared memory abstraction? you don't need to do durability on gpu anyway.
if you have weak consistency on memory, hard to program. but if you have weak consistency on file system, no big deal.