From UC-Berkeley
Solution for
stragglers:
- speculative execution, but wasted resources and/or times
Design spaces:
- LATE(osdi'08):
- Wantri(OSDI'10)
- Dolly (NSDI'13)
Design Principles:
Identify stragglers
as early as possible (to avoid wasted resources)
Schedule tasks for
improved job finish time (to avoid wasted resources and time)
Architecture of
Wrangler:
Master: model
builder, predictive scheduler
Slaves: workers
Selecting
"input features": memory, disk, run-time contention, faulty hardware
Using feature
selection methods: features of importance vary across
nodes and across time.
Why: complex
task-to-node interaction and task-to-task interaction, heterogeneous clusters
and task requirements
Approach:
classification techniques to build model automatically. They use SVM
Evaluation
~80% true positive
and true negative rate
Question: Is this accuracy good enough?
How to Answer:
improved job completion time? Reduced resource consumption? ---Key is Better
load-balancing.
Initial
evaluation: no better load-balancing
Second Iteration:
Use confidence measure
Final Evaluation:
Reduced job completion time and reduced resource consumption.
Insight: confidence is key!
Another question:
Sophisticated schedulers exist. Why Wrangler?
- Difficult to anticipate the dynamically chaning causes
- Difficult to build a generic and unbiased scheduler
Q&A:
Q: How to
differentiate stragglers due to poor environment and due to node actually has
more work to do.
A: In this work it
is not addressed and we will look into it.
Q: How does Wrangler
compare to existing techniques such as Late and Dolly
A: I don't have
numbers for that. But we provide a mechanism (?) which is on top of everything
else.
Q: How much time do
you need to train the model?
A: We keep
collecting data (a bit online fashion)