MultiPaxos Made Complete
Zhiying Liang, Vahab Jabrayilov, Aleksey Charapko, Abutalib Aghayev
TL;DR
MultiPaxos Made Complete identifies the lack of a practical blueprint for MultiPaxos and delivers a comprehensive, reproducible design with step-by-step pseudocode for leader election, accept/commit phases, and the failure detector, plus an open-source implementation. It introduces an adaptive, lightweight log compaction that avoids snapshots by using the global trim index $gle$ (global_last_executed) to safely prune logs, and an adaptive timeout to improve resilience under partial partitions. The evaluation shows competitive throughput and latency against established SMR systems, along with reduced memory overhead during compaction and robust availability under network partitions. Together these contributions provide a practical, production-ready MultiPaxos blueprint for robust replicated state machines.
Abstract
MultiPaxos, while a fundamental Replicated State Machine algorithm, suffers from a dearth of comprehensive guidelines for achieving a complete and correct implementation. This deficiency has hindered MultiPaxos' practical utility and adoption and has resulted in flawed claims about its capabilities. Our paper aims to bridge the gap between MultiPaxos' complexity and practical implementation through a meticulous and detailed design process spanning more than a year. It carefully dissects each phase of MultiPaxos and offers detailed step-by-step pseudocode -- in addition to a complete open-source implementation -- for all components, including the leader election, the failure detector, and the commit phase. The implementation of our complete design also provides better performance stability, resource usage, and network partition tolerance than naive MultiPaxos versions. Our specification includes a lightweight log compaction approach that avoids taking repeated snapshots, significantly improving resource usage and performance stability. Our failure detector, integrated into the commit phase of the algorithm, uses variable and adaptive heartbeat intervals to settle on a better leader under partial connectivity and network partitions, improving liveness under such conditions.
