LoLKV: The Logless, Linearizable, RDMA-based Key-Value Storage System
LoLKV is a novel logless replicated key-value storage system. LoLKV follows a fundamentally different approach for designing a linearizable key-value storage system compared to state-of-the-art systems. LoLKV forgoes the classical log-based design and uses lock-free approach to allow multiple threads to concurrently update objects. Our evaluation shows that LoLKV achieves 1.7–10× higher throughput and 20–92% lower tail latency than state-of-the-art alternatives.
Draconis: Network-Accelerated Scheduling for Microsecond-Scale Workloads
Draconis is a novel scheduler for workloads in the range of tens to hundreds of microseconds. Draconis challenges the popular belief that programmable switches cannot house the complex data structures, such as queues, needed to support an in-network scheduler. Using programmable switches, Draconis achieves the low scheduling tail latency and high throughput needed to support these microsecond-scale workloads on large clusters. Furthermore, Draconis supports a wide range of complex scheduling policies, including locality-aware scheduling, priority-based scheduling, and resource-based scheduling. Draconis reduces the 99th percentile scheduling latencies by 3×–200× when compared to state-of-the-art software-based and network-accelerated schedulers, on a range of synthetic workloads. Additionally, Draconis has 52× higher throughput than server-based scheduling systems.
MECBench: A Framework for Benchmarking Multi-Access Edge Computing Platforms
MECBench is an extensible benchmarking framework for multi-access edge computing. It can emulate networks with different capabilities and conditions, scale workloads to mimic a large number of clients, and generate a range of workload patterns. MECBench can be extended to change the generated workload, use new datasets, and integrate new applications.
Survey of Message Queueing Systems
We present a comprehensive survey of open-source message queueing systems. We selected 10 popular and diverse messaging MOM systems. For each system, we examine 42 features with a total of 134 different options. We report our insights and recommendation to the community. We have also created an annotated data set which can be used to help practitioners and developers understand and compare the features of different systems.
Partial Network Partitioning
This projects presents a comprehensive study of system failures caused by partial network partitions – an atypical type of network partitioning fault. Also, we dissected the design of eight popular systems, and we found that implemented fault tolerance techniques are inadequate for modern systems. Our findings motivated us to build Nifty, a transparent communication layer that masks partial network partitions.
Slogger: Scalable, Near-Zero Loss Disaster Recovery for Distributed Data Stores
Slogger is a new disaster recovery system that differs from prior works in two principle ways: (i) Slogger enables DR for a linearizable distributed data store, and (ii) Slogger adopts the continuous backup approach that strives to maintain a tiny lag on the backup site relative to the primary site, thereby restricting the data loss window, due to disasters, to milliseconds. Our evaluation shows that Slogger maintains a near optimal data loss window without having any performance penalty on the primary data store.
Network-Accelerated Consensus
FLAIR presents a novel protocol for accelerating read operations in leader-based consensus protocols. FLAIR leverages the capabilities of the new generation of programmable switches to serve reads from follower replicas without compromising consistency. Following FLAIR protocol, we designed FlairKV, a key-value store that implements the processing pipeline using the P4 programming language. Compared to state-of-the-art alternatives, FlairKV brings significant performance gains: up to 42% higher throughput and 35-97% lower latency for most workloads.
NEAT – Impact of Network Partitioning Failures on Distributed Systems
This project focuses on studying the impact of network partitioning failures on production-ready distributed systems. We analyzed 136 reported failures in 25 widely used distributed systems. In addition, we developed NEAT, a testing framework that is capable of injecting different types of network partitions.
NICE – Network-Integrated Cluster-Efficient Storage
NICE is a key-value storage system design that leverages new software-defined network capabilities to build cluster-based network-efficient storage system. NICE presents novel techniques to co-design network routing and multicast with storage replication, consistency, and load balancing to achieve higher efficiency, performance, and scalability.
COOL: A Cloud-Optimized Structure for MPI Collective Operations
COOL is a simple and generic structure for MPI collective operations. COOL enables highly efficient designs for all collective operations in the cloud. We explore a system design based on COOL that implements frequently used collective operations. Our design efficiently uses the intra-rack network while minimizing cross-rack communication, thus improving the application performance and scalability. We use recent software-defined networking capabilities to build optimal network paths for I/O intensive collective operations.