MultiPaxos Made Complete

Zhiying Liang; Vahab Jabrayilov; Aleksey Charapko; Abutalib Aghayev

MultiPaxos Made Complete

Zhiying Liang, Vahab Jabrayilov, Aleksey Charapko, Abutalib Aghayev

TL;DR

MultiPaxos Made Complete identifies the lack of a practical blueprint for MultiPaxos and delivers a comprehensive, reproducible design with step-by-step pseudocode for leader election, accept/commit phases, and the failure detector, plus an open-source implementation. It introduces an adaptive, lightweight log compaction that avoids snapshots by using the global trim index $gle$ (global_last_executed) to safely prune logs, and an adaptive timeout to improve resilience under partial partitions. The evaluation shows competitive throughput and latency against established SMR systems, along with reduced memory overhead during compaction and robust availability under network partitions. Together these contributions provide a practical, production-ready MultiPaxos blueprint for robust replicated state machines.

Abstract

MultiPaxos, while a fundamental Replicated State Machine algorithm, suffers from a dearth of comprehensive guidelines for achieving a complete and correct implementation. This deficiency has hindered MultiPaxos' practical utility and adoption and has resulted in flawed claims about its capabilities. Our paper aims to bridge the gap between MultiPaxos' complexity and practical implementation through a meticulous and detailed design process spanning more than a year. It carefully dissects each phase of MultiPaxos and offers detailed step-by-step pseudocode -- in addition to a complete open-source implementation -- for all components, including the leader election, the failure detector, and the commit phase. The implementation of our complete design also provides better performance stability, resource usage, and network partition tolerance than naive MultiPaxos versions. Our specification includes a lightweight log compaction approach that avoids taking repeated snapshots, significantly improving resource usage and performance stability. Our failure detector, integrated into the commit phase of the algorithm, uses variable and adaptive heartbeat intervals to settle on a better leader under partial connectivity and network partitions, improving liveness under such conditions.

MultiPaxos Made Complete

TL;DR

(global_last_executed) to safely prune logs, and an adaptive timeout to improve resilience under partial partitions. The evaluation shows competitive throughput and latency against established SMR systems, along with reduced memory overhead during compaction and robust availability under network partitions. Together these contributions provide a practical, production-ready MultiPaxos blueprint for robust replicated state machines.

Abstract

Paper Structure (21 sections, 8 figures, 1 table, 3 algorithms)

This paper contains 21 sections, 8 figures, 1 table, 3 algorithms.

Introduction
Background
Single-Decree Paxos and MultiPaxos
Identifying the Specification Gaps
Bridging the Specification Gaps
Leader Election
Accept Phase
Commit Phase with Heartbeats
Failure Detector
Adaptive Log Compaction
Partial Network Partition
Leader-Losing-Quorum Partition
Leader Churning and Adaptive Timeout Setting
Evaluation
Throughput vs. Latency
...and 6 more sections

Figures (8)

Figure 1: Overview of the MultiPaxos Algorithm
Figure 2: An example of the Prepare (left) and Accept (right) Phase. Each peer maintains a log consisting of instances. An instance contains a ballot (denoted as 'bal') and a value (e.g., x:=1). Initially, instances are marked as In-progress when the leader adds them to the log. They transition to Committed when safe for execution. The 'New' state, while not an actual state, is used to distinguish between new instances and existing ones. Note that, in the Accept Phase, the figure only depicts the merged log instead of the actual log to save space.
Figure 3: An example of the Commit Phase. The left part depicts the logs of all peers before the leader sends a new round of heartbeats. last_executed indicates the index of the last executed instance, and gle refers to the global minimum of last_executed across peers. The right section shows log changes after the leader's heartbeats and before followers' responses. In addition to Committed and In-progress, Executed represents instances already applied to the state machine.
Figure 4: The leader-losing-quorum scenario. 1. Peer A, B, C, and D disconnect from each other but Peer E. E still answers the heartbeat from Leader B. 2. Peer A, C, and D trigger leader election but lack of enough votes to become a leader. But a vote with a higher term number trigger Peer E to ignore heartbeats from old leader B. 3. A will start election as it does not receives any heartbeats. 4. A becomes the leader, and the availability resumes.
Figure 5: The leader churning partition, where Peer A and C disconnect from each other. 1. Peer C triggers timeout and prompts a leader election. Peer B becomes a follower of C due to a higher ballot number. 2. Peer A learns a new leader from B and becomes a follower, but it will start another election due to no heartbeats. A and C repeat leader elections, thus resulting low availability due to leadership churning. 3. B starts leader election as both A and C increase commit_interval and reduce the frequency of heartbeats. 4. describes that our MultiPaxos elects the stable Peer B as the leader to restore availability.
...and 3 more figures

MultiPaxos Made Complete

TL;DR

Abstract

MultiPaxos Made Complete

Authors

TL;DR

Abstract

Table of Contents

Figures (8)