Real Life Is Uncertain. Consensus Should Be Too!

Reginald Frank; Soujanya Ponnapalli; Octavio Lomeli; Neil Giridharan; Marcos K Aguilera; Natacha Crooks

Real Life Is Uncertain. Consensus Should Be Too!

Reginald Frank, Soujanya Ponnapalli, Octavio Lomeli, Neil Giridharan, Marcos K Aguilera, Natacha Crooks

TL;DR

The paper challenges the standard $f$-threshold fault model in distributed consensus, arguing that real-world faults are probabilistic, heterogeneous, and time-evolving. It introduces fault curves $p_u$ to capture per-node failure probabilities and analyzes how traditional PBFT and Raft-like protocols behave under these probabilistic faults, revealing end-to-end safety and liveness become probabilistic guarantees rather than absolutes. The authors show that exploiting fault curves can yield cost and energy savings, and they outline opportunities for dynamic quorums, reliability-aware leader selection, and new probability-native primitives. They propose a roadmap toward probabilistic consensus that better aligns with practical reliability notions and storage-like reliability metrics, with potential for substantial gains in efficiency and sustainability.

Abstract

Modern distributed systems rely on consensus protocols to build a fault-tolerant-core upon which they can build applications. Consensus protocols are correct under a specific failure model, where up to $f$ machines can fail. We argue that this $f$-threshold failure model oversimplifies the real world and limits potential opportunities to optimize for cost or performance. We argue instead for a probabilistic failure model that captures the complex and nuanced nature of faults observed in practice. Probabilistic consensus protocols can explicitly leverage individual machine \textit{failure curves} and explore side-stepping traditional bottlenecks such as majority quorum intersection, enabling systems that are more reliable, efficient, cost-effective, and sustainable.

Real Life Is Uncertain. Consensus Should Be Too!

TL;DR

The paper challenges the standard

-threshold fault model in distributed consensus, arguing that real-world faults are probabilistic, heterogeneous, and time-evolving. It introduces fault curves

to capture per-node failure probabilities and analyzes how traditional PBFT and Raft-like protocols behave under these probabilistic faults, revealing end-to-end safety and liveness become probabilistic guarantees rather than absolutes. The authors show that exploiting fault curves can yield cost and energy savings, and they outline opportunities for dynamic quorums, reliability-aware leader selection, and new probability-native primitives. They propose a roadmap toward probabilistic consensus that better aligns with practical reliability notions and storage-like reliability metrics, with potential for substantial gains in efficiency and sustainability.

Abstract

machines can fail. We argue that this

-threshold failure model oversimplifies the real world and limits potential opportunities to optimize for cost or performance. We argue instead for a probabilistic failure model that captures the complex and nuanced nature of faults observed in practice. Probabilistic consensus protocols can explicitly leverage individual machine \textit{failure curves} and explore side-stepping traditional bottlenecks such as majority quorum intersection, enabling systems that are more reliable, efficient, cost-effective, and sustainable.

Paper Structure (8 sections, 2 theorems, 2 tables)

This paper contains 8 sections, 2 theorems, 2 tables.

Introduction
Faults are probabilistic
Analysis of Consensus Protocols
Consensus Primer
Analysis and Key Takeaways
A Probabilistic Vision
Related Work
Conclusion

Key Result

theorem 1

PBFT is safe iff all these conditions hold: PBFT is live iff all these conditions hold:

Theorems & Definitions (2)

theorem 1
theorem 2

Real Life Is Uncertain. Consensus Should Be Too!

TL;DR

Abstract

Real Life Is Uncertain. Consensus Should Be Too!

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (2)