by Xiaohui Gu, North Carolina State University
Key Idea:
Use prediction: discover that after 30 seconds something bad will happen. This lead time is critical for recovery.
Decouple recovery and diagnostics.
Past approaches for reliability: reactive or proactive
replication doesn't solve, say, deterministic bugs.
How to make short time abnormality predictions:
Performance anomaly (SLO, response time, etc.)
Monitor system level metrics: CPU, memory, network, disk I/O
In the system level metric space, found the sub-space where it triggers performance anomaly
Feature prediction techniques are used to make the predictions on which system level state we will reach in 30s or 5mins (Simple Markov model works better than other complex models for short time precitions, say 30s;).
Statistical learning techniques are used to learn that given a system state, what is the probably of performance abnormality of this state?
How to do training:
Need to use failure data.
Could use unsupervised learning instead of supervised learning.
Short time results:
For memory leak bugs: 90% accuracy in prediction for Markov + NaiveBayes, 10% false alarms
For disk failures: 80%-90% accuracy, 10% false alarms.
Training time: milliseconds level. Prediction time: microseconds level.
How to do medium-term prediction:
Use wavelet-based to transform signals in time serials.
Then make multiple short time predictions to medium term prediction.
Propagation-based Fault Localization:
For distributed system related bugs? Didn't understand....
Prediction Anomaly Prevention:
1. Learning to attribute failure on a particular metric, memory or cpu or disk?
2. resource scaling or VM migration to do recovery automatically
Efficient/Green Cloud Computing:
One can also use the above the prediction techniques to do resource need prediction and elastic resource scaling.
Questions:
selective replication by Remzi's group?
why two steps? not one step?
learning sie for false is small samples
the wavelet ideas?
feedbacks in medium term predictions?
没有评论:
发表评论