While conducting a comprehensive study of network failures in 25 modern systems, we identified partial network partitioning, an unusual network fault that causes catastrophic failures.
Partial network partitioning is a network fault that disrupts communication among some but not all nodes in a cluster. The figure shows how a partial network partition divides a cluster into three groups of nodes so that two groups (Group 1 and Group 2) are disconnected, but Group 3 can communicate with Groups 1 and 2.
We were curious to understand how these faults impact systems. Is the community aware of this fault? Are there established fault-tolerance techniques?
To answer these questions, we conducted a comprehensive study of system failures caused by partial network partitions. We analyzed 51 reports of failure caused by a partial network partitioning fault in 12 modern systems. We found that this fault leads to catastrophic failure such as data loss and complete system shutdown, that these failures are easy to manifest once a partial partition occurs, and that the majority of failures are due to design flaws. Finally, we found that all failures are reproducible using only 5 nodes.
While we did not find any discussions of this fault in the literature, we found eight popular systems (VoltDB, MapReduce, HBase, MongoDB, Elasticsearch, Mesos, LogCabin, and RabbitMQ) that implemented fault-tolerance techniques specifically to tolerate partial partitioning faults. We dissected the design of these eight popular systems and identified four principled approaches for tolerating partial partitions. Unfortunately, our analysis shows that implemented fault-tolerance techniques are inadequate for modern systems; they either patch a particular mechanism, or they may lead to a complete cluster shutdown, even when alternative network paths exist.
Our findings motivated us to build Nifty, a transparent communication layer that masks partial network partitions. Nifty builds an overlay between nodes to detour packets around partial partitions. Our prototype evaluation with six popular systems shows that Nifty overcomes the shortcomings of current fault-tolerance approaches and effectively masks partial partitions while imposing negligible overhead.
- Nifty framework for tolerating partial partitions: source code
- NEAT framework for testing with partial partitions: source code
 Toward a Generic Fault Tolerance Technique for Partial Network Partitioning
Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, Samer Al-Kiswany
USENIX Symposium on Operating Systems Design and Implementation, 2020. [pdf] [slides]