Disaster Recovery for Distributed Data Stores – Waterloo Advanced Systems Lab (WASL)

This project presents a new Disaster Recovery (DR) system, called Slogger, that differs from prior works in two principle ways: (i) Slogger enables DR for a linearizable distributed data store, and (ii) Slogger adopts the continuous backup approach that strives to maintain a tiny lag on the backup site relative to the primary site,thereby restricting the data loss window, due to disasters, to milliseconds. These goals pose a significant set of challenges related to consistency of the backup site’s state, failures, and scalability. Slogger employs a combination of asynchronous log replication, intra-data center synchronized clocks, pipelining, batching, and anovel watermark service to address these challenges. Furthermore, Slogger is designed to be deployable as an “add-on” module in an existing distributed data store with few modifications to the original code base.

Our evaluation, conducted on Slogger extensions to a 32-sharded version of LogCabin, an open source key-value store, shows that Slogger maintains a very small data loss window of 14.2 milliseconds which is near the optimal value in our evaluation setup. Moreover, Slogger reduces the length of the data loss window by 50% compared to incremental snapshotting technique without having any performance penalty on the primary data store. Furthermore, our experiments demonstrate that Slogger achieves our other goals of scalability, fault tolerance, and efficient failover to the backup data store when a disaster is declared at the primary data store.

People

Publications

[1] Scalable, Near-Zero Loss Disaster Recovery for Distributed Data Stores
Ahmed Alquraan, Alex Kogan, Virendra Marathe, Samer Al-Kiswany
Proc. VLDB Endowment. Sept. 2020 [pdf][slides]