2013年2月8日星期五

HARDEN FS (selective 2-verisoning on HDFS)

common solution to deal with fail-silent failure:

1. using repliated state machine
2. n-versioning programming


Main idea of HARDFS (selective 2-versioning):

0. better crash than lie! Thus keep watching and whenever somebody is doing something wrong, either recover  or just kill them. 
0. make use of the fact that the system is already robust, and being able to recover from a lot of failures (e.g., crashes)
1. selective (only replicate important state)
2. use bloom filters to compactly encode states (i.e, all file system states are encoded in terms of yes-or-no questions) -- they use a particular kind of boom filter which supports deletion
3. ask-then-check for unboolean verification???

Evaluation of HARDFS:
1. detect bit-flip error pretty well, more crashes because more bookkeeping  (better crash than lie!)
     (how well it is on more realistic/correlated errors is still unknown -- butt he did do experiments which shows it protects bugs from mozzila bug report?)
2. 3% space overhead,  8% performance overhead, 12% additional code  -- because only a part of state is replicated and a part of code is 2-versioned.




More details on HARDFS:


selected part:
harden namespace management, 
harden replica management
harden read/write protocol

micro-recovery



second verison watches input/output

node behavior model

state manager:

state manager to replicate subset of states (need to understand HDFS semantics)

use bloom filters because it does boolean verification well

to update bloom filter state (ask main version for values and check with the 2nd version bloom filter)

actively modified states in concrete form to enable in-place updates -- to avoid CPU overhead

false positive in bloom filter only results in unnecessary recovery (as long as faults are transitive, non-deterministic)


action verifier:

four types of wrong actions:
        corrupt
        missing
        orphan
        out of order

Handling disagreement:

        using domain knowledge to ignore false alarms
        for true alarms, I think they just re-start system using on-disk and other node's states
        
Recovery:
        1. crash and reboot (expensive)
        2. micro-recovery (pin-pointted corrupted state by comparing states of two verisons, then only reconstrute corrupted state from disk)
                however needs to remove corrupted state in bloom-filters
                solution: new bloom filter to start over, and add all right states to that new bloom filters
               
        3. thwarting destructive instructions
               ???? master just tell node?????



       
               





























没有评论:

发表评论