2014年11月3日星期一

Wrangler: Predictable and Faster Jobs using Fewer Resources

From UC-Berkeley

Solution for stragglers:
  1. speculative execution, but wasted resources and/or times
Design spaces:
  1. LATE(osdi'08):
  2. Wantri(OSDI'10)
  3. Dolly (NSDI'13)

Design Principles:
Identify stragglers as early as possible (to avoid wasted resources)
Schedule tasks for improved job finish time (to avoid wasted resources and time)

Architecture of Wrangler:
Master: model builder, predictive scheduler
Slaves: workers

Selecting "input features": memory, disk, run-time contention, faulty hardware
Using feature selection methods: features of importance vary across nodes and across time.
Why: complex task-to-node interaction and task-to-task interaction, heterogeneous clusters and task requirements
Approach: classification techniques to build model automatically. They use SVM


Evaluation
~80% true positive and true negative rate
Question: Is this accuracy good enough?
How to Answer: improved job completion time? Reduced resource consumption? ---Key is Better load-balancing.
Initial evaluation:  no better load-balancing
Second Iteration: Use confidence measure
Final Evaluation: Reduced job completion time and reduced resource consumption.
Insight: confidence is key!

Another question: Sophisticated schedulers exist. Why Wrangler?
  1. Difficult to anticipate the dynamically chaning causes
  2. Difficult to build a generic and unbiased scheduler

Q&A:
Q: How to differentiate stragglers due to poor environment and due to node actually has more work to do.
A: In this work it is not addressed and we will look into it.
Q: How does Wrangler compare to existing techniques such as Late and Dolly
A: I don't have numbers for that. But we provide a mechanism (?) which is on top of everything else.
Q: How much time do you need to train the model?

A: We keep collecting data (a bit online fashion)

SOCC'14 Session 1: High Performance Data Center Operating Systems and Networks

Arrakis: An OS for the Data Center

Systems in data center generally I/O bound
Today's I/O devices are fast (NIC, raid controller etc), but the OS cannot match with it.

Kernel: API, Naming, ACL, Protection, I/O scheduling, etc: two heavyweighed

Arrakis: Skip kernel and deliver I/O directly to applications, but keep classical server OS features.

Hardware can help, because more and more functionalites embedded in hardware (SR-IOV, IOMMU, Packet filters, logical disks, NIC rate limiter, etc.)

Approach: put protection, multiplexing, I/O scheduling to device, put API an I/O scheduling to Application, put naming, ACL, resouce limiting still in kernel as they are not in data path. So: device + application: data plane, kernel: control plane.

Kernel: do ACL control once when confiuring the data plane, virtual file system for naming
Redis (application): persistent data structures (log, queue etc.)

Results: In-memeary get latency reduced by 65%, put latency by 75%, 1.75x GET throughput etc.

Implication: we are all OS developers now.
I/O hardware-application co-design
Application needs fine-grained control (aka openflow): where in memory do packets go, how to route packets through cores, etc.
Application-specific storage design

Question:
Q:How does it compare with hacked Linux kernel?
A: No specific answer. Some people worked on "hacked Linux kernel", e.g., user-level networking, or Remzi's work (?)
Q: Limitations? In particular binding for large scale applications?
A: Limitations on hardware. E.g, you can't have more than a few virtual disks on a real disk, but you can do hundreds for network devices (?)



Network Subways and Rewiring:

Today's datacenter tension: cost vs. capacity, above ToR switches, average link utilization only 25%

Why: rack-level traffic is bursty/long tailed

Subways: multiple ports per server
So, what do we do with the extra links?
Today:  wire to multiple core switches
Propose: connect to neighbor TOR, less ToR traffic, distribution more evenly

Result: memcached up to 2.8x performance improvement

Question:
Q:Wiring across racks could concern people (datacenter administrators)
A: We haven't talked with those people, but there is a huge performance benefit
Q: How does this change failure modes?
A: Large scale failre modes we don't know. But we can do faster local recovery etc.
Q: Power usage?
Q: Competing jobs and your rewiring?
A: We have more flexibility


2014年6月7日星期六

Generalized Filesystem Dependencies

C Frost et al., SOSP 2007

Some Background:
three rules to ensure metadata consistency:
1. Never write pointers before initializing the structure it points to.
2. Never reuse a resource before nullifying all pointers to it.
3. Never clear last pointer to live resources before setting new one.

Soft-updates:
Keep track of dependencies between blocks (B->A, block A must be written before B)
But also need to keep track of undo information to changes: block granularity causes false sharing and cycles between blocks. Undo changes to break block dependency cycles.

Key Idea:
Similar to soft-update, but implemented in a file system independent way.
The key idea is a new division of labor: file system explicitly manipulate consistency dependency informations, while the kernel write back mechanism is solely responsible for flushing blocks to disk while respecting dependency.
Dependencies are presented between "patches", while a patch is just a single change to one block.
Rules to enforce:
dep[C] &sub C
dep[F] &sub C  ( dep[B_B] &sub (C &cup F_B) )

Benefit and Potential: 
This separation of consistency manipulation and implementation would allow different parties to easily cooperate to enforce consistency. Virtualization seems like an obvious fit. In the paper they talk about loopback device consistency.

Application level consistency is also promising. They propose an application level "patch group", which is a mechanism to make one set of file changes depend on another set; in addition to the dependencies file system enforces. This is implemented by inserting two empty patches at the start and end of a patch group, and make one group's start patch depend on the other group's end patch.

One can also imagine that without file system dictating when to write blocks to disk, the block level scheduler now have more freedom to optimize. This would completely separate the decision of "where to write a block (by file system)" and "when to write a block (by kernel block scheduler)". It is interested related to split level scheduling, where when to write is based on file system informations, or delayed delayed allocation, where we try to combine the decision of where to write and when to write.

Criticism/Possible improvement:
1. In order to track the committed set, they would require knowledge of when each block actually hit disk surface, which is generally not available... They do this by using a combination of NCQ and FUA support offered by SATA, but it is still expensive. I am not quite sure how to better though. Maybe a set of blocks committed?

2. Every changes to every block seems like too much to keep track of...Of course they do implementation optimizations. But seems like a simpler abstraction, say on the block level, or introduce atomicity, would be more useful? If it is on the block level, then it is more like soft-updates, which is again hard to implement. So maybe one should stare more at the consistency requirements and come up with a better abstraction, which is both for filesystem to manipulate, and for buffer/block layer to implement.

3. Even though they implemented soft updates and journaling, and svn application consistency. They didn't use this mechanism to enable more interesting things: virtualization, different consistency guarantees to different clients, file system cooperation, scheduler optimization etc...

Related Work:
1. Soft updates do things at the block level, but in a file system dependent way.
2. CAPFS and Echo considered customizable application-level consistency protocols in the context of distributed, parallel file systems. Echo maintains a partial order on the locally cached updates to the remote file system.
3. Burnett's thesis describes a system tracking dependencies among system calls, associate dirty blocks with unique IDs returned by those calls, and duplicate dirty blocks when necessary to preserve ordering.
4. Xsynfs's external synchrony provide uses with the same consistency guarantees as synchronous writes, but are implemented by committing blocks in groups using a journaling design.

Other resources:
Note of this paper:
http://www.scs.stanford.edu/13wi-cs240/notes/featherstitch.txt

2014年5月12日星期一

We're In Trouble, Computers Are Not Helping

James Mickens, Microsoft Research, Distributed System Group

MapReduce: boring...zzzzzzzz

Cloud is dirty underneath...

Unmodified POSIX/Win32 applications running with cloud scale I/O performance

App
--------------------------
Blizzard virtual drive
---------------------------
Disks

1. Delayed durability semantics: flush is an order semantic only, not durability semantic anymore
2. How to avoid buffer a lot of data: treat back storage as a distributed log, and only checkpoint up to the durability point

Now:
All previous writes are durable
All after writes are not durable (which is nice...)

But still, it might be impossible to have the cloud work

Security: government is more powerful than you give it credit for.  Using cryptography is kind like using a gun to defend against the government....

Cloud Security:
Even if you encrypt all your data, cloud provider could still monitor your computation (search and fetch, etc.) and derive correlations.

2014年4月25日星期五

为什么女性缺少大师

女性争取到选举权有100多年的历史 
女性外出参加工作有100多年的历史 

中国女性普遍接受教育有60余年 
中国男权社会有4000多年的历史 
60年的时间内,专家呼吁男孩危机,号召高知女性回归家庭,鼓吹剩女舆论 

是的,从古至今,艺术文学各各领域很少出现大师级别的女性。因为她们不是人,根本没有被当做一个人看待。她们只是附庸品,被侮辱被损害被踩到脚底被当做商品任意交换的物品。即使是妻子,也可以被丈夫自由转卖。 

基督教曾经郑重讨论女性是否有灵魂,佛教认为女性是污秽的,没有进入极乐世界的资格,必须投胎转世成男性才能得到救赎。 

这些年来女性在社会中占的份量确实比以前高了很多,但是相应而来的也是男性对此的反应:各种制约女性地位、控制女性发展的或成文或不成文的措施。

女性真正成为“人”只有一百年历史,100年女性崛起的历史对抗5000年的男权社会,她们还没有积蓄足够的力量,所以处于顶尖层次的女性大师极其稀少,这是现实,也是历史。

2014年4月10日星期四

How to Obfuscate Software

Sanjam Garg   IBM Watson

make computer programs "unintelligible" without affecting their functionality

What are we trying to hide (using an example)
what kind of sequence do obfuscation try to hide?
1. try to hide private key in a program?
2. patching software securely: don't want hackers to use patches to reverse engineer what the bug was
    --use obfuscated patch

Methods:
1. manual obscurity by programmer effort (not that difficult to decipher...)
2. Crypto Obfuscation (by just changing function names, add dummy variables etc...) -- doesn't hide the sequence inside the program
All ad-hoc. So they makes the hacker's mission hard, but not impossible

f[password, m](x) = m if x = password
                                 null otherwise
if you know the password, you should know what m is
if you don't know the password, m should also be hidden from you

New mathematical tools:
Multilinear Maps

Solved a few open problems:
1. Non-interactive Key Agreement
    from two party setting (1976) to a three party setting (2000)
    now for arbitrary number of parties (2013)
2. Functional Encryption for all Circuits
    Public-Private key,  but now want Google to check spam (give specialized SK' to Google instead of SK) --- Alice could just give the obfuscated program of spam checker to Google)
     Prior Work: very limited functions for which secret keys could be issued
     Now: All functions!
3. Witness Encryption
     encrypt to arbitrary recipient whoever with  a proof (or, a NP Statement)
     e.g.: twin prime conjecture, $1M to the solver put in a safe box
             whoever has a proof to twin prime conjecture should be able to use it to open that safe box
     Prior: don't know if possible or not
     Now: Solved!
And many others...(Remove random oracles,  properties of zero-knowledge)


How do we do it?
high level idea:
1. for every input x, there is a unique set of puzzle piece which reveals the output on input x
    every other combination reveals nothing
    two steps: program->unencrypted pieces->encrptyed pieces

e.g. a special function (witness encryption)
       G = (S, X) be a bipartite graph, b is {0, 1}
       S = {A,B,C,D,E,F}
       X = {1, 2, 3, 4, 5, 6, 7}
f[G,b](T) = b (T is an exact cover)
                   o otherwise
want: hide the bit b
Exact cover is NP-complete, obfuscator doesn't know the exact cover
security: G has no exact cover --> b is hidden (a bit simpler/weaker, but still no trivial as when you write the program, you don't know if G has exact cover or not)

Step 2: (their new encryption scheme)
an a level encryption scheme for Zp, prime p
have Enc1, Enc2, Enc3...Encn
this (new) encryption scheme has the following property:
hiding: Enci(x) hides x
multiplication: Enci(x)* Encj(y) = Enc(i+j)(x*y)
equality: Encn(x), Encn(y) can check if x = y

Step 1: Jigsaw Puzzle
for each vertex X, pick a random number ai
7-level Encryption scheme (based on what A, B, C, D...connect to)
  eA = Enc3(a1a4a7)
  eB = Enc2(a1a4)
  eC = Enc3(a4a5a7)
  eD = Enc3(a3a5a6)
  eE = Enc4(a2a3a6a7)
  eF = Enc2(a2a7)

e = Enc7(a1a2a3a4a5a6a7) if b = 0
      Enc7(r)  if b = 1

given three pieces eB, eD, eF (which is an exact cover)
eB*eD*eF = Enc2(a1a4)*Enc3(a3a5a6)*Enc2(a2a7) = Enc7(a1a2a3a4a5a6a7)
recovered!

what if G has no exact cover (remove edge B->a4 in G)
now eB chafes to Enc1(a1)
now you cannot recover based on multiplication.

Q: why multiplication instead of addition?
A. If addition, you can subtract. But division would be hard

Q: why is this more general than previous scheme?
A: Now when I encrypt, I don't know the key! I just know the key has certain properties
Q2: YES, but how does this enable you to do interesting things?
A: That's complicated...

Future work:
1. What (minimal) computational assumptions are needed for obfuscation?
2. How to make obfuscation more efficient? (Now it is very inefficient). Might be hard, but we may be able to simplify for specific applications.  Or maybe use trusted hardware.

Q: What are your computational assumptions?
A: It is complicated...=,=
Q: How inefficient?
A: Polynomial. power 7 or 8, I don't even want to mention the constant factors...



   

2014年4月4日星期五

Strata: High-Performance Scalable Storage on Virtualized Non-volatile Memory

Dutch Mayer, Coho Data

(An extension to the FAST'14 talk)
Actually I think it is the most SDS like work I ever heard about. They deal with SDN-storage interaction too


Goal: take high end flash memory, add all the enterprise features and sustain the high end performance as much as possible (not possible to sustain all because the flash is so fast, everything you put on top is problematic)

PCI-based Flash: a first step toward a real use of non-volatile memory
Problem:
1. It is really fast, and if you put it in a system, something will break and you won't get the raw performance.
   -- prioritize device utilization
2. Hardware evolves really quickly, so need to abstract it in some way
    -- virtualize, scale out
3. Have to play well with others (dealing with enterprise, you have to support old protocols, various hardware, different application domains, etc...)

Ideas:
high end flash now look a lot like CPUs: fast, expensive, mostly idle. So we want to virtualize it like we virtualize CPU.
Actually flash is frequently bottlenecked on CPU (if you add more cores, you might get 80% performance boost). So you really need to balance the cpu, the network and the flash


architecture 1. virtualization:
each flash organized as a log structure file system with updated records
records form objects, and objects are organized in a B+ tree style, eventually a big dumb address space

architecture 2. flexibility
inspired by Click: packets (requests) pass through pipeline, (data pahs) at each stage of the pipeline, you mutate it (or pass it...). That's how they handle replication, tripping, etc. (This is kind of SDS like, isn't it...)

load balancing with data paths: just by chaining data paths
from data paths to protocols: a library which binds to different kind of front end (NFS, fuse, mysql). They also allow you to push things down into the data path...

They ship with an SDN switch...(and a lot of other stuff which their box work with...)

Some technical problems:
1. support NFSv3 as a scale-out architecture (they use SDN Scaling, use the switch as a load balancer by pushing some rules. they are limited by the size of rule tables though)
2. remote replication in mutable trees

Future work:
1. they focused on utilization, bu QoS is important too
2. tiering and prefetch
3. next year's flash is 2x better

Q&A:
Q:How do you analyze where you bottleneck. Where do you bottleneck?
A: Btree, uncontested locks (because of cache coherence) and almost everything. Doing analysis is really hard, instrumentation and tracing help a lot. Then we come up with a theory and try test it.









2014年4月2日星期三

Google glasses, memory and computational neurology

FIXME:
write it...
(especially how brain and other facilities interact)

2014年3月27日星期四

A new class of bugs

classify bugs, then an ecosystem for a bug class
see also: 
          Toward Optimization-Safe Systems: Analyzing the Impact of Undefined Behavior (SOSP'13 best paper)
           http://yangsuli.blogspot.com/2013/11/sosp13-papers.html


e.g.:
buffer overflow:
tools: Purify, Valgrind
System: ALSR
Languages: Java

A new class: unstable code
state of the art: turn off optimizations if seeing weird bugs

How prevalent?
How to think about it?
How to detect?

undefined behavior leads to unstable code
example: if(buf + off < buf) (gcc translates to if(false) because pointer overflow behavior is undefined
also: signed integer overflow, oversized shit, null pointer dereference, absolute value overflow
code may or may not work

How prevalent?
all major compilers, all major languages (they have a nice graph in paper) silently discard unstable code (in different ways). so changing/upgrading compiler may lead to broken system
more optimization over time (which makes problems worse)

How to think about it?
first approach: consider unstable code as dead code
problem: restricted to one particular compiler, and lists of false warnings because compiler always kills code

second apprach: as bug patterns
problem: aka antivirus software, inherently inomplete

third approach: as undefined hehavior
problem: too many false positives

Final approach:
cause: disagree on spec (undefined behavior)
Formulation overview:
disagreement delta: compiler: program never invokes undefined heavier
what can be done only with delta: kill unstable code 

Two round simulation: first without delta, second with delta, see if output is different
i.e., first time SAT oracle->N, and second time SAT oracle->Y, then unstable code.

Formulate delta:
reach(e, in)
undef(e, in)
delta(in) = for every e: reach(e, in) -> not undef(e, in)

SAT oracle: (booldector solver)
false negative or positive:
1. phase 1 not powerful enough->false errors
2. phase 2 not powerful enought->false negative


How to detect:
STACK: unstable code checker
challenge: previous statements requests inspect the entire programs, would lead to gigantic boolean predicate
solution: per-function and approximation 

Summary:
unstable code lead to subtle bugs
language designer: be cautious about undefined behavior
compiler writers: use our techniques to generate better warnings
programs: check your code using STACK...

Research approach: (I like this!)
identify systems problems
derive general solutions
build practical systems/tools
e.g. GUI not-responding (as graph reachability problem)
       replay (as graph cut problem)
killing bugs: theorem proving (CompCert: formally verified C compiler CACM'09, seL4: formally verified OS kernel but that's just 8,700 LOC)
            identify subcompnones in Linux kernel to verify


Q&A:
Q:Is it because C language just makes some decisions? 
A: It is more like if you could design a standard which doesn't have this problem? C developers actually have incentives to make signed integers overflow undefined. If you want to define buffer overflow, then you have to instrument every memory write, then you need a GC, you just turned C into Java...
Q: Is missing code the only result undefined behavior results?Why do you choose to focus on code discarding?
A: An interesting direction to pursue...We focused on this because that's what got our attention. 
Q: Does higher level language have fewer undefined behavior? Is it because of the nature of the language being high level, or is it because it's just newer thus better?
A: Yes. And unstable code is (kinda) OK in application level, not so in system level code. 
Q: Other undefined behavior like function argument evaluation order, sequence points?
A: I feel like it's less interesting because it can be solved by the frond-end of the compiler. Also, you could statically check that in your source code. There are also C-light without sequence point problems.



2014年3月7日星期五

Greening Datacenters Through Self-Generation of Renewable Energy

(Software defined datacenter power management. And they build a real system: hardware + software on a small size datacenter)

Thu Nguyen,   Rutgers 

Motivation: 
Datacenters uses lots of energy.....Most of it consumed by small datacenters.
Can we use renewables?

Approaches:
1. Buy renewable energy from other power plants off-site. (Google buys wind power from NextEra) --- not stable enough, transmission loss.
2. Self-generation, co-location (Apple built 40MW solar array in NC) --- location may not be ideal for DC or power plant.

Trends in solar energy:
Current PV efficiency (solar energy to electricity) ~15%, expect to grow to ~40% in 10 years.
Capacity efficiency (this largely depends on location) ~15-25%
Cost of solar energy system installation decreasing in time, expect to go down by 40%-60% in 10 years (DOE data)

Main Challenge: power supply is variable, and it may not match your workload consumption

Idea: Match world to energy supply, instead of matching energy supply to workload demands.
         (And you can use net metering to sell the power your battery can't store to your grid electricity provider)

Solution: 
Hardware (Parasol):
Steel structure on the roof + Backup power (battery + grid)  + IT equipment + Cooling
(Actually we don't need cooling that much, even though they are really energy consuming)
Software:
Use Hadoop workload scheduling to predict energy availability (based on weather forecast). Schedule jobs on renewable energy first, then low-price grid electricity, then peak-price electricity, as long as we can meet the job completion deadline.
Also, make sure we don't draw a lot of power from grid at any particular time (high peak usage adds up to bills!)
GreenSwitch: keep keys on jobs, servers on/off, battery lifetime management, peak time power and total power consumption, then output an energy source schedule and a workload schedule.

Results:
Parasol (use polar energy) without software: 60% cost save, amortize in 7 years.
Parasol + GreenSwitch: 75% cost save, but batteries are too expensive to amortize...
Parasol + GreenSwitch + Deferrable Workloads: 96% cost saving, solar + battery amortized in 7.6 years.
(But they did the experiment in summer, so might be too optimistic.....)

Aside:
They have a year of datacenter energy profile data (for every minute, the power efficacy at each inverter, the temperature and the humidity, etc)


2014年2月19日星期三

Argon: performance insulation for shared storage servers

Approach:
1. time slice based
2. Increase prefetch/write-back size to insulate streaming efficienc from disk seeks introduced by competing workloads
3. Cache partition

Interference causes:
Disk interference, cache interference

Metric:
fraction of throughput it achieves when it has the storage system to itself, within its share of the service (R value)




2014年2月5日星期三

Some disk accounting related stuff

Trading Capacity for Performance in a Disk Array (OSDI'00)
Takeaway:
1. One could derive analytically models of disk performance (latency, throughput, etc) based on the disk and workload characteristics (rotational time, seek time, busyness of the system/queue size, read/write ratio, seek locality) and how we place data (striping, replication etc).
2. These models could be used to guide scheduling, and maybe accounting, by actually predicting what's going to happen when serving these requests.
3. Current head position, physical track layout etc., could be found by issuing requests to the disk.

Robust, Portable I/O Scheduling with the Disk Mimic (ATC'03)
by Remzi et. al
Takeaway:
1. Online disk service time prediction based on a more abstract model: instead of attempt to simulate the mechanism or components internal to the disk, simply reproduce the output as a function of the inputs observed.
2. Need to reduce the input space: most important inputs are just logical distance and request types.
3. This method accounts for most of the latency of the disk, but just in a probabilistic manner. And accuracy doesn't necessary translate to performance in scheduling.
4. As most paper from Remzi's group, it has a beautiful evaluation section where they really explored every possible configuration of their solution and understood what is going on.

Argon: performance insulation of shared storage servers (FAST'07)
from CMU
Takeaway:
1. The primary interference of storage system come from disk head time and cache space, which cannot be easily insulated.
2. The use prefetching/write-back to ensure a high disk head efficiency (which is kind of orthogonal to insulation), then cache partition and time slicing for insulation. With the parameters of the above three techniques dynamically adjusted by some analytic models.
3. They say that trade-off between efficiency and latency is fundamental.
4. Using feedback for scheduling is difficult here because you don't know what to expect due to the variability of performance from different workload, so a theoretical prediction could help here.