Root Cause Analysis for Microservice Systems via Cascaded Conditional Learning with Hypergraphs
Shuaiyu Xie, Hanbin He, Jian Wang, Bing Li
TL;DR
The paper tackles root cause analysis in microservice systems by addressing two core tasks, root cause localization ($RCL$) and failure type identification ($FTI$), and identifies limitations of parallel multitask learning and pairwise instance modeling. It introduces CCLH, a three-phase framework that combines multimodal feature extraction, a three-type heterogeneous hypergraph for group-influenced status fusion, and cascaded conditional learning that first localizes the culprit before diagnosing its failure type. Key contributions include (i) a three-level taxonomy of inter-instance group relations captured via UniGAT-HE on a heterogeneous hypergraph, (ii) a cascaded training paradigm that aligns with real-world diagnostic workflows, and (iii) extensive evaluation across three MSS datasets showing superior $RCL$ and $FTI$ performance and good generalization, with efficiency suitable for online use. The results demonstrate that modeling group influences and respecting the causal order between tasks significantly improves diagnostic accuracy and interpretability, offering practical benefits for SRE teams in cloud-native environments. Future work may extend CCLH to incorporate platform events and develop automatic parameter tuning for the task trigger to further enhance robustness.
Abstract
Root cause analysis in microservice systems typically involves two core tasks: root cause localization (RCL) and failure type identification (FTI). Despite substantial research efforts, conventional diagnostic approaches still face two key challenges. First, these methods predominantly adopt a joint learning paradigm for RCL and FTI to exploit shared information and reduce training time. However, this simplistic integration neglects the causal dependencies between tasks, thereby impeding inter-task collaboration and information transfer. Second, these existing methods primarily focus on point-to-point relationships between instances, overlooking the group nature of inter-instance influences induced by deployment configurations and load balancing. To overcome these limitations, we propose CCLH, a novel root cause analysis framework that orchestrates diagnostic tasks based on cascaded conditional learning. CCLH provides a three-level taxonomy for group influences between instances and incorporates a heterogeneous hypergraph to model these relationships, facilitating the simulation of failure propagation. Extensive experiments conducted on datasets from three microservice benchmarks demonstrate that CCLH outperforms state-of-the-art methods in both RCL and FTI.
