Partial Network Partitioning – Waterloo Advanced Systems Lab (WASL)

While conducting a comprehensive study of network failures in 25 modern systems, we identified partial network partitioning, an unusual network fault that causes catastrophic failures.

Partial network partitioning is a network fault that disrupts communication among some but not all nodes in a cluster. The figure shows how a partial network partition divides a cluster into three groups of nodes so that two groups (Group 1 and Group 2) are disconnected, but Group 3 can communicate with Groups 1 and 2.

We were curious to understand how these faults impact systems. Is the community aware of this fault? Are there established fault-tolerance techniques?

To answer these questions, we conducted a comprehensive study of system failures caused by partial network partitions. We analyzed 51 reports of failure caused by a partial network partitioning fault in 12 modern systems. We found that this fault leads to catastrophic failure such as data loss and complete system shutdown, that these failures are easy to manifest once a partial partition occurs, and that the majority of failures are due to design flaws. Finally, we found that all failures are reproducible using only 5 nodes.

While we did not find any discussions of this fault in the literature, we found eight popular systems (VoltDB, MapReduce, HBase, MongoDB, Elasticsearch, Mesos, LogCabin, and RabbitMQ) that implemented fault-tolerance techniques specifically to tolerate partial partitioning faults. We dissected the design of these eight popular systems and identified four principled approaches for tolerating partial partitions. Unfortunately, our analysis shows that implemented fault-tolerance techniques are inadequate for modern systems; they either patch a particular mechanism, or they may lead to a complete cluster shutdown, even when alternative network paths exist.

Our findings motivated us to build Nifty, a transparent communication layer that masks partial network partitions. Nifty builds an overlay between nodes to detour packets around partial partitions. Our prototype evaluation with six popular systems shows that Nifty overcomes the shortcomings of current fault-tolerance approaches and effectively masks partial partitions while imposing negligible overhead.

Downloads

Nifty framework for tolerating partial partitions: source code
NEAT framework for testing with partial partitions: source code

People

Publications

https://cs.uwaterloo.ca/~alkiswan/papers/Slicify_MASCOTS24.pdf
[1] Slicify: Fault Injection Testing for Network Partitions
Seba Khaleel*, Sreeharsha Udayashankar*, Samer Al-Kiswany
IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Oct. 2024 [pdf]

[2] CASPR: Connectivity-Aware Scheduling for Partition Resilience
Sara Qunaibi, Sreeharsha Udayashankar, Samer Al-Kiswany
International Symposium on Reliable Distributed Systems (SRDS), 2023. [pdf]

[3] Partial Network Partitioning
Basil AlKhatib, Sreeharsha Udayashankar, Sara Qunaibi, Ahmed Alquraan, Mohammed Alfatafta, Wael Al-Manasrah, Alex Depoutovitch, Samer Al-Kiswany
ACM Transactions on Computer Systems (TOCS), 2022. [pdf]

[4] Toward a Generic Fault Tolerance Technique for Partial Network Partitioning
Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, Samer Al-Kiswany
USENIX Symposium on Operating Systems Design and Implementation, 2020. [pdf] [slides]