NEAT – Network-Partitioning Fault Tolerance

Created with Sketch.

We conducted a comprehensive study of 136 system failures due to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects such as data loss, broken lock, the reappearance of deleted data, and system crash. The majority of the failures can easily manifest once a network partition happens: they require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases one needs to consider to find a failure is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms.

We found that the majority of the failures could have been avoided by design and code reviews, and could have been discovered by simple testing if testers could inject network faults. Following our findings, we built NEAT, a testing framework that simplifies coordinating multiple clients and can inject different types of network-partitioning faults.




[1] An Analysis of Network-Partitioning Failures in Cloud Systems
Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, Samer Al-Kiswany
USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2018. (18% acceptance rate) [pdf][slides]