Table of Contents
Fetching ...

Spider: A BFT Architecture for Geo-Replicated Cloud Services

Michael Eischer, Tobias Distler

TL;DR

Spider tackles the challenge of Byzantine fault tolerance in geo-distributed cloud services by decoupling agreement from execution and placing replica groups in cloud regions to minimize latency. It introduces a modular architecture with an agreement group handling total order and execution groups near clients, connected by inter-regional message channels (IRMCs) that provide built-in flow control and authentication. The approach is augmented with optimizations such as signature sharing and batching, plus offloading of client verification, and extended to support sequential consistency at the read level. Empirical evaluation in public clouds shows substantially lower latency and scalable adaptability compared with traditional BFT and hierarchical designs, demonstrating practical impact for geo-replicated services. Overall, Spider achieves low, stable end-to-end latency while maintaining strong fault tolerance through modularity and cloud-aware design.

Abstract

Traditionally, Byzantine fault tolerance (BFT) in geo-replicated systems is achieved by executing complex agreement protocols over large-distance communication links, and therefore typically incurs high response times. In this paper we address this problem with Spider, a resilient and modular BFT replication architecture for geo-distributed systems that leverages characteristic features of today's public-cloud infrastructures to minimize both complexity as well as latency. Spider is composed of multiple largely independent replica groups that each are distributed across different availability zones of their respective cloud region. This design offers the possibility to provide low response times by placing replica groups in close geographic distance to clients, while at the same time enabling intra-group communication over short-distance links. To handle the interaction between groups that is necessary for strong consistency, Spider uses a novel message-channel abstraction with first-in-first-out semantics and built-in flow control that greatly simplifies system design.

Spider: A BFT Architecture for Geo-Replicated Cloud Services

TL;DR

Spider tackles the challenge of Byzantine fault tolerance in geo-distributed cloud services by decoupling agreement from execution and placing replica groups in cloud regions to minimize latency. It introduces a modular architecture with an agreement group handling total order and execution groups near clients, connected by inter-regional message channels (IRMCs) that provide built-in flow control and authentication. The approach is augmented with optimizations such as signature sharing and batching, plus offloading of client verification, and extended to support sequential consistency at the read level. Empirical evaluation in public clouds shows substantially lower latency and scalable adaptability compared with traditional BFT and hierarchical designs, demonstrating practical impact for geo-replicated services. Overall, Spider achieves low, stable end-to-end latency while maintaining strong fault tolerance through modularity and cloud-aware design.

Abstract

Traditionally, Byzantine fault tolerance (BFT) in geo-replicated systems is achieved by executing complex agreement protocols over large-distance communication links, and therefore typically incurs high response times. In this paper we address this problem with Spider, a resilient and modular BFT replication architecture for geo-distributed systems that leverages characteristic features of today's public-cloud infrastructures to minimize both complexity as well as latency. Spider is composed of multiple largely independent replica groups that each are distributed across different availability zones of their respective cloud region. This design offers the possibility to provide low response times by placing replica groups in close geographic distance to clients, while at the same time enabling intra-group communication over short-distance links. To handle the interaction between groups that is necessary for strong consistency, Spider uses a novel message-channel abstraction with first-in-first-out semantics and built-in flow control that greatly simplifies system design.
Paper Structure (31 sections, 13 figures)

This paper contains 31 sections, 13 figures.

Figures (13)

  • Figure 1: BFT geo-replication architectures connecting a client (C) with leader (L) and follower (F) replicas.
  • Figure 2: Spider system architecture
  • Figure 3: Interfaces of Spider's main building blocks
  • Figure 4: Conceptual view of an example IRMC with two independent subchannels that both have a maximum capacity of ten messages (M). Senders ($S_*$) and receivers ($R_*$) access the subchannels via their local endpoints; each endpoint manages its own subchannel-specific flow-control windows.
  • Figure 5: Overview of Spider's replication protocol
  • ...and 8 more figures