Table of Contents
Fetching ...

Root Cause Analysis for Microservice Systems via Cascaded Conditional Learning with Hypergraphs

Shuaiyu Xie, Hanbin He, Jian Wang, Bing Li

TL;DR

The paper tackles root cause analysis in microservice systems by addressing two core tasks, root cause localization ($RCL$) and failure type identification ($FTI$), and identifies limitations of parallel multitask learning and pairwise instance modeling. It introduces CCLH, a three-phase framework that combines multimodal feature extraction, a three-type heterogeneous hypergraph for group-influenced status fusion, and cascaded conditional learning that first localizes the culprit before diagnosing its failure type. Key contributions include (i) a three-level taxonomy of inter-instance group relations captured via UniGAT-HE on a heterogeneous hypergraph, (ii) a cascaded training paradigm that aligns with real-world diagnostic workflows, and (iii) extensive evaluation across three MSS datasets showing superior $RCL$ and $FTI$ performance and good generalization, with efficiency suitable for online use. The results demonstrate that modeling group influences and respecting the causal order between tasks significantly improves diagnostic accuracy and interpretability, offering practical benefits for SRE teams in cloud-native environments. Future work may extend CCLH to incorporate platform events and develop automatic parameter tuning for the task trigger to further enhance robustness.

Abstract

Root cause analysis in microservice systems typically involves two core tasks: root cause localization (RCL) and failure type identification (FTI). Despite substantial research efforts, conventional diagnostic approaches still face two key challenges. First, these methods predominantly adopt a joint learning paradigm for RCL and FTI to exploit shared information and reduce training time. However, this simplistic integration neglects the causal dependencies between tasks, thereby impeding inter-task collaboration and information transfer. Second, these existing methods primarily focus on point-to-point relationships between instances, overlooking the group nature of inter-instance influences induced by deployment configurations and load balancing. To overcome these limitations, we propose CCLH, a novel root cause analysis framework that orchestrates diagnostic tasks based on cascaded conditional learning. CCLH provides a three-level taxonomy for group influences between instances and incorporates a heterogeneous hypergraph to model these relationships, facilitating the simulation of failure propagation. Extensive experiments conducted on datasets from three microservice benchmarks demonstrate that CCLH outperforms state-of-the-art methods in both RCL and FTI.

Root Cause Analysis for Microservice Systems via Cascaded Conditional Learning with Hypergraphs

TL;DR

The paper tackles root cause analysis in microservice systems by addressing two core tasks, root cause localization () and failure type identification (), and identifies limitations of parallel multitask learning and pairwise instance modeling. It introduces CCLH, a three-phase framework that combines multimodal feature extraction, a three-type heterogeneous hypergraph for group-influenced status fusion, and cascaded conditional learning that first localizes the culprit before diagnosing its failure type. Key contributions include (i) a three-level taxonomy of inter-instance group relations captured via UniGAT-HE on a heterogeneous hypergraph, (ii) a cascaded training paradigm that aligns with real-world diagnostic workflows, and (iii) extensive evaluation across three MSS datasets showing superior and performance and good generalization, with efficiency suitable for online use. The results demonstrate that modeling group influences and respecting the causal order between tasks significantly improves diagnostic accuracy and interpretability, offering practical benefits for SRE teams in cloud-native environments. Future work may extend CCLH to incorporate platform events and develop automatic parameter tuning for the task trigger to further enhance robustness.

Abstract

Root cause analysis in microservice systems typically involves two core tasks: root cause localization (RCL) and failure type identification (FTI). Despite substantial research efforts, conventional diagnostic approaches still face two key challenges. First, these methods predominantly adopt a joint learning paradigm for RCL and FTI to exploit shared information and reduce training time. However, this simplistic integration neglects the causal dependencies between tasks, thereby impeding inter-task collaboration and information transfer. Second, these existing methods primarily focus on point-to-point relationships between instances, overlooking the group nature of inter-instance influences induced by deployment configurations and load balancing. To overcome these limitations, we propose CCLH, a novel root cause analysis framework that orchestrates diagnostic tasks based on cascaded conditional learning. CCLH provides a three-level taxonomy for group influences between instances and incorporates a heterogeneous hypergraph to model these relationships, facilitating the simulation of failure propagation. Extensive experiments conducted on datasets from three microservice benchmarks demonstrate that CCLH outperforms state-of-the-art methods in both RCL and FTI.

Paper Structure

This paper contains 28 sections, 13 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The group relationships between instances.
  • Figure 2: Part of call relationships in Online Boutique.
  • Figure 3: Overall architecture of CCLH.
  • Figure 4: Example of hypergraph construction. The left part of this figure illustrates the dependency graph of one MSS, capturing partial relationships between instances. The right part shows the corresponding hypergraph transformed from this dependency graph.
  • Figure 5: Performance comparison across different task triggers.
  • ...and 1 more figures