Table of Contents
Fetching ...

RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic

Andrea Tonon, Meng Zhang, Bora Caglayan, Fei Shen, Tong Gui, MingXue Wang, Rong Zhou

TL;DR

RADICE addresses root-cause analysis for system performance by distinguishing causation from correlation in time-series monitoring. It introduces a causal domain knowledge model for partial expert input and a four-phase pipeline—discovery, enhancement, refinement, subtraction—to output a root-cause causal sub-graph $\mathbf{G_{RC}}$ linking the performance metric $X_t$ to candidate root causes. The method augments data-driven causal discovery (PCMCI+) with an entropy-based orientation and graph refinement to reduce spurious edges, while the domain knowledge guides pruning. Experiments on simulated data and a real advertising-use-case show RADICE outperforms correlation-based baselines and standard causal discovery methods, producing actionable root-cause sub-graphs and interpretable time-series plots.

Abstract

Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic. It aims to identify the root causes of system performance anomalies, allowing the resolution or the future prevention of issues that can cause millions of dollars in losses. Common existing approaches relying on data correlation or full domain expert knowledge are inaccurate or infeasible in most industrial cases, since correlation does not imply causation, and domain experts may not have full knowledge of complex and real-time systems. In this work, we define a novel causal domain knowledge model representing causal relations about the underlying system components to allow domain experts to contribute partial domain knowledge for root cause analysis. We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph showing the causal relations between the system components affected by the anomaly. We evaluated RADICE with simulated data and reported a real data use case, sharing the lessons we learned. The experiments show that RADICE provides better results than other baseline methods, including causal discovery algorithms and correlation based approaches for root cause analysis.

RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic

TL;DR

RADICE addresses root-cause analysis for system performance by distinguishing causation from correlation in time-series monitoring. It introduces a causal domain knowledge model for partial expert input and a four-phase pipeline—discovery, enhancement, refinement, subtraction—to output a root-cause causal sub-graph linking the performance metric to candidate root causes. The method augments data-driven causal discovery (PCMCI+) with an entropy-based orientation and graph refinement to reduce spurious edges, while the domain knowledge guides pruning. Experiments on simulated data and a real advertising-use-case show RADICE outperforms correlation-based baselines and standard causal discovery methods, producing actionable root-cause sub-graphs and interpretable time-series plots.

Abstract

Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic. It aims to identify the root causes of system performance anomalies, allowing the resolution or the future prevention of issues that can cause millions of dollars in losses. Common existing approaches relying on data correlation or full domain expert knowledge are inaccurate or infeasible in most industrial cases, since correlation does not imply causation, and domain experts may not have full knowledge of complex and real-time systems. In this work, we define a novel causal domain knowledge model representing causal relations about the underlying system components to allow domain experts to contribute partial domain knowledge for root cause analysis. We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph showing the causal relations between the system components affected by the anomaly. We evaluated RADICE with simulated data and reported a real data use case, sharing the lessons we learned. The experiments show that RADICE provides better results than other baseline methods, including causal discovery algorithms and correlation based approaches for root cause analysis.
Paper Structure (16 sections, 4 figures, 2 tables, 2 algorithms)

This paper contains 16 sections, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: Example of a whole causal graph of a system (left) and of a root cause causal sub-graph associated with an anomaly in the same system (right). Nodes represent system component metrics and performance metrics, and edges represent causal relations between them. $X7$ is the performance metric of the system. $X2$ is the component that caused the anomaly (the root cause of the anomaly). $X3$, $X4$, $X5$, $X6$, $X7$, and $X8$ are all components affected by the anomaly: while $X4$ and $X5$ are intermediate components between the root cause and the performance metric, $X3$, $X6$, and $X8$ do not have a causal impact on the performance metric.
  • Figure 2: Schema of RADICE.
  • Figure 3: Results for $minSim$. It shows recall and precision of RADICE w/o DK and RADICE$(L,50E)$ with simulated data (with $N=10$ and $N=15$ nodes) varying $minSim$.
  • Figure 4: Results with advertising data. It shows the causal sub-graph obtained with each algorithm for case 1 and case 2. Each metric is represented by a different color. For RADICE, $exposure\_rate$ and the root causes are represented by continuous lines, while intermediate components by dashed lines. For CoFlux, PCMCI+, and TCDF, we only reported the portion connected with $exposure\_rate$.