Real Life Is Uncertain. Consensus Should Be Too!
Reginald Frank, Soujanya Ponnapalli, Octavio Lomeli, Neil Giridharan, Marcos K Aguilera, Natacha Crooks
TL;DR
The paper challenges the standard $f$-threshold fault model in distributed consensus, arguing that real-world faults are probabilistic, heterogeneous, and time-evolving. It introduces fault curves $p_u$ to capture per-node failure probabilities and analyzes how traditional PBFT and Raft-like protocols behave under these probabilistic faults, revealing end-to-end safety and liveness become probabilistic guarantees rather than absolutes. The authors show that exploiting fault curves can yield cost and energy savings, and they outline opportunities for dynamic quorums, reliability-aware leader selection, and new probability-native primitives. They propose a roadmap toward probabilistic consensus that better aligns with practical reliability notions and storage-like reliability metrics, with potential for substantial gains in efficiency and sustainability.
Abstract
Modern distributed systems rely on consensus protocols to build a fault-tolerant-core upon which they can build applications. Consensus protocols are correct under a specific failure model, where up to $f$ machines can fail. We argue that this $f$-threshold failure model oversimplifies the real world and limits potential opportunities to optimize for cost or performance. We argue instead for a probabilistic failure model that captures the complex and nuanced nature of faults observed in practice. Probabilistic consensus protocols can explicitly leverage individual machine \textit{failure curves} and explore side-stepping traditional bottlenecks such as majority quorum intersection, enabling systems that are more reliable, efficient, cost-effective, and sustainable.
