Table of Contents
Fetching ...

CSnake: Detecting Self-Sustaining Cascading Failure via Causal Stitching of Fault Propagations

Shangshu Qian, Lin Tan, Yongle Zhang

TL;DR

CSnake tackles the challenge of exposing self-sustaining cascading failures in distributed systems by introducing causal stitching, a counterfactual fault-causality framework. It links multiple single-fault injections across diverse workloads through FCA, guided by a three-phase test-budget protocol (3PA) and a local compatibility check to form valid propagation cycles. The system detects 15 new self-sustaining failures across five open-source platforms, with five confirmed fixes, outperforming naive or fuzzing-based approaches. CSnake provides a practical, JVM-based toolchain with coastal implementation details and case studies, indicating its potential to improve pre-release resilience and fault containment in real-world deployments.

Abstract

Recent studies have revealed that self-sustaining cascading failures in distributed systems frequently lead to widespread outages, which are challenging to contain and recover from. Existing failure detection techniques struggle to expose such failures prior to deployment, as they typically require a complex combination of specific conditions to be triggered. This challenge stems from the inherent nature of cascading failures, as they typically involve a sequence of fault propagations, each activated by distinct conditions. This paper presents CSnake, a fault injection framework to expose self-sustaining cascading failures in distributed systems. CSnake uses the novel idea of causal stitching, which causally links multiple single-fault injections in different tests to simulate complex fault propagation chains. To identify these chains, CSnake designs a counterfactual causality analysis of fault propagations - fault causality analysis (FCA): FCA compares the execution trace of a fault injection run with its corresponding profile run (i.e., same test w/o the injection) and identifies any additional faults triggered, which are considered to have a causal relationship with the injected fault. To address the large search space of fault and workload combinations, CSnake employs a three-phase allocation protocol of test budget that prioritizes faults with unique and diverse causal consequences, increasing the likelihood of uncovering conditional fault propagations. Furthermore, to avoid incorrectly connecting fault propagations from workloads with incompatible conditions, CSnake performs a local compatibility check that approximately checks the compatibility of the path constraints associated with connected fault propagations with low overhead. CSnake detected 15 bugs that cause self-sustaining cascading failures in five systems, five of which have been confirmed with two fixed.

CSnake: Detecting Self-Sustaining Cascading Failure via Causal Stitching of Fault Propagations

TL;DR

CSnake tackles the challenge of exposing self-sustaining cascading failures in distributed systems by introducing causal stitching, a counterfactual fault-causality framework. It links multiple single-fault injections across diverse workloads through FCA, guided by a three-phase test-budget protocol (3PA) and a local compatibility check to form valid propagation cycles. The system detects 15 new self-sustaining failures across five open-source platforms, with five confirmed fixes, outperforming naive or fuzzing-based approaches. CSnake provides a practical, JVM-based toolchain with coastal implementation details and case studies, indicating its potential to improve pre-release resilience and fault containment in real-world deployments.

Abstract

Recent studies have revealed that self-sustaining cascading failures in distributed systems frequently lead to widespread outages, which are challenging to contain and recover from. Existing failure detection techniques struggle to expose such failures prior to deployment, as they typically require a complex combination of specific conditions to be triggered. This challenge stems from the inherent nature of cascading failures, as they typically involve a sequence of fault propagations, each activated by distinct conditions. This paper presents CSnake, a fault injection framework to expose self-sustaining cascading failures in distributed systems. CSnake uses the novel idea of causal stitching, which causally links multiple single-fault injections in different tests to simulate complex fault propagation chains. To identify these chains, CSnake designs a counterfactual causality analysis of fault propagations - fault causality analysis (FCA): FCA compares the execution trace of a fault injection run with its corresponding profile run (i.e., same test w/o the injection) and identifies any additional faults triggered, which are considered to have a causal relationship with the injected fault. To address the large search space of fault and workload combinations, CSnake employs a three-phase allocation protocol of test budget that prioritizes faults with unique and diverse causal consequences, increasing the likelihood of uncovering conditional fault propagations. Furthermore, to avoid incorrectly connecting fault propagations from workloads with incompatible conditions, CSnake performs a local compatibility check that approximately checks the compatibility of the path constraints associated with connected fault propagations with low overhead. CSnake detected 15 bugs that cause self-sustaining cascading failures in five systems, five of which have been confirmed with two fixed.

Paper Structure

This paper contains 49 sections, 7 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: A real-world self-sus-taining cascading failure from Amazon AWS. The Cell (Cluster) Manager manages a cluster of hosts (servers), each storing a set of data shards.
  • Figure 2: Model of fault injection experiments of cascading failures, characterized by different system states.
  • Figure 3: Overview of C-Sna-ke. Blue boxes are components of C-Sna-ke.
  • Figure 4: Pseudo-code demonstrating the injection and monitor points. "TP" means throw point and "NP" negation point. "MP" means monitor point (used and explained in \ref{['sec:chain-forming:approx-check']}).
  • Figure 5: Pseudo-code demonstrating delay in nested loops, code simplified from BPServiceActor in HDFS.
  • ...and 2 more figures