Table of Contents
Fetching ...

GeoShield: Byzantine Fault Detection and Recovery for Geo-Distributed Real-Time Cyber-Physical Systems

Yifan Cai, Linh Thi Xuan Phan

TL;DR

GeoShield addresses safety for geo-distributed cyber-physical systems under Byzantine faults without trusted hardware. It introduces a bounded-time recovery framework that detects inter-region faults, estimates probabilistic latency bounds with a Byzantine-resilient network measurement protocol, and uses a Timeliness Governance System to bound malicious delays, all while coordinating cross-region recovery through recovery propagation. The approach combines PoCs for commission faults and RP-based recovery with intra-region Rebound-style fault tolerance, ensuring safety within $D_{max}^{rec}$ and improved robustness over prior BFT methods. Case studies in railway control and smart grids, plus extensive evaluations, demonstrate substantial resource efficiency and practical viability for real-world CPS deployments.

Abstract

Large-scale cyber-physical systems (CPS), such as railway control systems and smart grids, consist of geographically distributed subsystems that are connected via unreliable, asynchronous inter-region networks. Their scale and distribution make them especially vulnerable to faults and attacks. Unfortunately, existing fault-tolerant methods either consume excessive resources or provide only eventual guarantees, making them unsuitable for real-time resource-constrained CPS. We present GeoShield, a resource-efficient solution for defending geo-distributed CPS against Byzantine faults. GeoShield leverages the property that CPS are designed to tolerate brief disruptions and maintain safety, as long as they recover (i.e., resume normal operations or transition to a safe mode) within a bounded amount of time following a fault. Instead of masking faults, it detects them and recovers the system within bounded time, thus guaranteeing safety with much fewer resources. GeoShield introduces protocols for Byzantine fault-resilient network measurement and inter-region omission fault detection that proactively detect malicious message delays, along with recovery mechanisms that guarantee timely recovery while maximizing operational robustness. It is the first bounded-time recovery solution that operates effectively under unreliable networks without relying on trusted hardware. Evaluations using real-world case studies show that it significantly outperforms existing methods in both effectiveness and resource efficiency.

GeoShield: Byzantine Fault Detection and Recovery for Geo-Distributed Real-Time Cyber-Physical Systems

TL;DR

GeoShield addresses safety for geo-distributed cyber-physical systems under Byzantine faults without trusted hardware. It introduces a bounded-time recovery framework that detects inter-region faults, estimates probabilistic latency bounds with a Byzantine-resilient network measurement protocol, and uses a Timeliness Governance System to bound malicious delays, all while coordinating cross-region recovery through recovery propagation. The approach combines PoCs for commission faults and RP-based recovery with intra-region Rebound-style fault tolerance, ensuring safety within and improved robustness over prior BFT methods. Case studies in railway control and smart grids, plus extensive evaluations, demonstrate substantial resource efficiency and practical viability for real-world CPS deployments.

Abstract

Large-scale cyber-physical systems (CPS), such as railway control systems and smart grids, consist of geographically distributed subsystems that are connected via unreliable, asynchronous inter-region networks. Their scale and distribution make them especially vulnerable to faults and attacks. Unfortunately, existing fault-tolerant methods either consume excessive resources or provide only eventual guarantees, making them unsuitable for real-time resource-constrained CPS. We present GeoShield, a resource-efficient solution for defending geo-distributed CPS against Byzantine faults. GeoShield leverages the property that CPS are designed to tolerate brief disruptions and maintain safety, as long as they recover (i.e., resume normal operations or transition to a safe mode) within a bounded amount of time following a fault. Instead of masking faults, it detects them and recovers the system within bounded time, thus guaranteeing safety with much fewer resources. GeoShield introduces protocols for Byzantine fault-resilient network measurement and inter-region omission fault detection that proactively detect malicious message delays, along with recovery mechanisms that guarantee timely recovery while maximizing operational robustness. It is the first bounded-time recovery solution that operates effectively under unreliable networks without relying on trusted hardware. Evaluations using real-world case studies show that it significantly outperforms existing methods in both effectiveness and resource efficiency.

Paper Structure

This paper contains 34 sections, 6 theorems, 2 equations, 14 figures, 6 tables.

Key Result

Lemma 1

It is impossible for a node to send a valid heartbeat message more than $\Delta_{\mathsf{early}}\xspace = \Delta_{\mathsf{syn}}\xspace + \Delta_{\mathsf{intra}}\xspace + \Delta_{\mathsf{hb}}\xspace$ time units before the scheduled time.

Figures (14)

  • Figure 1: A railway control system hollysys-brochure.
  • Figure 2: Challenges and solutions in GeoShield.
  • Figure 3: Different phases of the measurement protocol.
  • Figure 4: Mechanisms to detect and recover from different types of faults.
  • Figure 5: Inter-region task dataflow.
  • ...and 9 more figures

Theorems & Definitions (14)

  • Lemma 1: Early heartbeat
  • Theorem 1: Accuracy
  • Theorem 2: Consensus on latency
  • Theorem 3: Consensus on correctness
  • Theorem 4: TGS Properties
  • Theorem 5: BTR Guarantee
  • proof
  • proof
  • proof
  • proof
  • ...and 4 more