2014年11月3日星期一

Wrangler: Predictable and Faster Jobs using Fewer Resources

From UC-Berkeley

Solution for stragglers:
  1. speculative execution, but wasted resources and/or times
Design spaces:
  1. LATE(osdi'08):
  2. Wantri(OSDI'10)
  3. Dolly (NSDI'13)

Design Principles:
Identify stragglers as early as possible (to avoid wasted resources)
Schedule tasks for improved job finish time (to avoid wasted resources and time)

Architecture of Wrangler:
Master: model builder, predictive scheduler
Slaves: workers

Selecting "input features": memory, disk, run-time contention, faulty hardware
Using feature selection methods: features of importance vary across nodes and across time.
Why: complex task-to-node interaction and task-to-task interaction, heterogeneous clusters and task requirements
Approach: classification techniques to build model automatically. They use SVM


Evaluation
~80% true positive and true negative rate
Question: Is this accuracy good enough?
How to Answer: improved job completion time? Reduced resource consumption? ---Key is Better load-balancing.
Initial evaluation:  no better load-balancing
Second Iteration: Use confidence measure
Final Evaluation: Reduced job completion time and reduced resource consumption.
Insight: confidence is key!

Another question: Sophisticated schedulers exist. Why Wrangler?
  1. Difficult to anticipate the dynamically chaning causes
  2. Difficult to build a generic and unbiased scheduler

Q&A:
Q: How to differentiate stragglers due to poor environment and due to node actually has more work to do.
A: In this work it is not addressed and we will look into it.
Q: How does Wrangler compare to existing techniques such as Late and Dolly
A: I don't have numbers for that. But we provide a mechanism (?) which is on top of everything else.
Q: How much time do you need to train the model?

A: We keep collecting data (a bit online fashion)

SOCC'14 Session 1: High Performance Data Center Operating Systems and Networks

Arrakis: An OS for the Data Center

Systems in data center generally I/O bound
Today's I/O devices are fast (NIC, raid controller etc), but the OS cannot match with it.

Kernel: API, Naming, ACL, Protection, I/O scheduling, etc: two heavyweighed

Arrakis: Skip kernel and deliver I/O directly to applications, but keep classical server OS features.

Hardware can help, because more and more functionalites embedded in hardware (SR-IOV, IOMMU, Packet filters, logical disks, NIC rate limiter, etc.)

Approach: put protection, multiplexing, I/O scheduling to device, put API an I/O scheduling to Application, put naming, ACL, resouce limiting still in kernel as they are not in data path. So: device + application: data plane, kernel: control plane.

Kernel: do ACL control once when confiuring the data plane, virtual file system for naming
Redis (application): persistent data structures (log, queue etc.)

Results: In-memeary get latency reduced by 65%, put latency by 75%, 1.75x GET throughput etc.

Implication: we are all OS developers now.
I/O hardware-application co-design
Application needs fine-grained control (aka openflow): where in memory do packets go, how to route packets through cores, etc.
Application-specific storage design

Question:
Q:How does it compare with hacked Linux kernel?
A: No specific answer. Some people worked on "hacked Linux kernel", e.g., user-level networking, or Remzi's work (?)
Q: Limitations? In particular binding for large scale applications?
A: Limitations on hardware. E.g, you can't have more than a few virtual disks on a real disk, but you can do hundreds for network devices (?)



Network Subways and Rewiring:

Today's datacenter tension: cost vs. capacity, above ToR switches, average link utilization only 25%

Why: rack-level traffic is bursty/long tailed

Subways: multiple ports per server
So, what do we do with the extra links?
Today:  wire to multiple core switches
Propose: connect to neighbor TOR, less ToR traffic, distribution more evenly

Result: memcached up to 2.8x performance improvement

Question:
Q:Wiring across racks could concern people (datacenter administrators)
A: We haven't talked with those people, but there is a huge performance benefit
Q: How does this change failure modes?
A: Large scale failre modes we don't know. But we can do faster local recovery etc.
Q: Power usage?
Q: Competing jobs and your rewiring?
A: We have more flexibility