Table of Contents
Fetching ...

MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge

Shuai Liang, Pengfei Chen, Bozhe Tian, Gou Tan, Maohong Xu, Youjun Qu, Yahui Zhao, Yiduo Shang, Chongkang Tan

TL;DR

MetaRCA shows robust cross-system generalization, maintaining over 80% accuracy across diverse systems, and surpasses the strongest baseline by 29 percentage points in service-level and 48 percentage points in metric-level accuracy.

Abstract

The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally limited by three intertwined challenges: poor scalability against system complexity, brittle generalization across different system topologies, and inadequate integration of domain knowledge. These limitations create a vicious cycle, hindering the development of robust and efficient RCA solutions. This paper introduces MetaRCA, a generalizable RCA framework for cloud-native systems. MetaRCA first constructs a Meta Causal Graph (MCG) offline, a reusable knowledge base defined at the metadata level. To build the MCG, we propose an evidence-driven algorithm that systematically fuses knowledge from Large Language Models (LLMs), historical fault reports, and observability data. When a fault occurs, MetaRCA performs a lightweight online inference by dynamically instantiating the MCG into a localized graph based on the current context, and then leverages real-time data to weight and prune causal links for precise root cause localization. Evaluated on 252 public and 59 production failures, MetaRCA demonstrates state-of-the-art performance. It surpasses the strongest baseline by 29 percentage points in service-level and 48 percentage points in metric-level accuracy. This performance advantage widens as system complexity increases, with its overhead scaling near-linearly. Crucially, MetaRCA shows robust cross-system generalization, maintaining over 80% accuracy across diverse systems.

MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge

TL;DR

MetaRCA shows robust cross-system generalization, maintaining over 80% accuracy across diverse systems, and surpasses the strongest baseline by 29 percentage points in service-level and 48 percentage points in metric-level accuracy.

Abstract

The dynamics and complexity of cloud-native systems present significant challenges for Root Cause Analysis (RCA). While causality-based RCA methods have shown significant progress in recent years, their practical adoption is fundamentally limited by three intertwined challenges: poor scalability against system complexity, brittle generalization across different system topologies, and inadequate integration of domain knowledge. These limitations create a vicious cycle, hindering the development of robust and efficient RCA solutions. This paper introduces MetaRCA, a generalizable RCA framework for cloud-native systems. MetaRCA first constructs a Meta Causal Graph (MCG) offline, a reusable knowledge base defined at the metadata level. To build the MCG, we propose an evidence-driven algorithm that systematically fuses knowledge from Large Language Models (LLMs), historical fault reports, and observability data. When a fault occurs, MetaRCA performs a lightweight online inference by dynamically instantiating the MCG into a localized graph based on the current context, and then leverages real-time data to weight and prune causal links for precise root cause localization. Evaluated on 252 public and 59 production failures, MetaRCA demonstrates state-of-the-art performance. It surpasses the strongest baseline by 29 percentage points in service-level and 48 percentage points in metric-level accuracy. This performance advantage widens as system complexity increases, with its overhead scaling near-linearly. Crucially, MetaRCA shows robust cross-system generalization, maintaining over 80% accuracy across diverse systems.
Paper Structure (36 sections, 16 equations, 8 figures, 4 tables)

This paper contains 36 sections, 16 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Shifting from system-specific RCA to reusable causal patterns.
  • Figure 2: The framework of MetaRCA.
  • Figure 3: Performance comparison of different RCA methods as the number of microservice nodes increases. (a) Efficiency, measured by execution time. (b) Service-level localization accuracy, measured by AC@3.
  • Figure 4: Distribution of localization accuracy (AC@3) for different RCA methods across seven systems (three open-source and four production). The box plots show the performance distribution at (a) the service level and (b) the metric level.
  • Figure 5: Localization accuracy (AC@3) of different causal graph construction and ranking method combinations. The performance is shown on (a) the RE2-TT dataset and (b) the Production dataset, with results for both service-level (total height) and metric-level (solid portion) accuracy.
  • ...and 3 more figures