Techniques for efficient and fault-tolerant geo-replication

Liu, Shengyun

Modern internet applications replicate their service across globally deployed data centers, in order to serve their clients reliably and efficiently. When confronted with failures such as machine crashes or data center outages, many applications are relying on a State Machine Replication (SMR) protocol to keep service reliable.

However, since performance of network connections among remotely located data centers is limited and non-uniform, existing SMR protocols are not well applicable to the new environment. Existing work typically target one of two fault models: (1) crash-fault model (Crash-Fault Tolerance, CFT), in which machines can halt, i.e., stop processing requests; or (2) Byzantine-fault model (Byzantine Fault Tolerance, BFT), in which machines can behave arbitrarily, i.e., do not follow the protocol faithfully.

The first contribution of this thesis introduces XPaxos, a novel SMR protocol that decouples the fault space across the machine and network faults dimensions, allowing for both Byzantine machines and network faults (i.e., communication asynchrony), yet treating them separately. XPaxos tolerates faults beyond crashes in an efficient and practical way, featuring many more nines of reliability than the celebrated crash-tolerant Paxos protocol, without impacting its resource/operation costs while maintaining the similar performance. Despite its low cost and high performance that match the best CFT protocols, we also show that XPaxos always provides better availability, and sometimes (depending on the system environment) even offers strictly stronger consistency guarantees than state-of-the-art BFT replication protocols.

In the second contribution, we argue that the ``right'' choice of the number of leaders in a geo-replication protocol depends on a given replica configuration and propose Droopy and Dripple, two orthogonal optimizations for SMR protocols, respectively based on the multi-leader approach and request commutativity. We implement Droopy and Dripple on top of Clock-RSM, a state-of-the-art all-leader protocol. Our evaluation on Amazon EC2 shows that, under typical imbalanced workloads, Droopy enabled Clock-RSM efficiently reduces latency compared to native protocol, and it has the similar latency as a leaderless protocol --- EPaxos. In contrast, under balanced and non-commutative workloads, Droopy and Dripple enabled Clock-RSM reduces latency compared to EPaxos, and it has the similar latency as native protocol.

Digital Security
Eurecom Ref:
© TELECOM ParisTech. Personal use of this material is permitted. The definitive version of this paper was published in and is available at :
See also: