RADICE: Causal Graph Based Root Cause Analysis for System Performance Diagnostic
Andrea Tonon, Meng Zhang, Bora Caglayan, Fei Shen, Tong Gui, MingXue Wang, Rong Zhou
TL;DR
RADICE addresses root-cause analysis for system performance by distinguishing causation from correlation in time-series monitoring. It introduces a causal domain knowledge model for partial expert input and a four-phase pipeline—discovery, enhancement, refinement, subtraction—to output a root-cause causal sub-graph $\mathbf{G_{RC}}$ linking the performance metric $X_t$ to candidate root causes. The method augments data-driven causal discovery (PCMCI+) with an entropy-based orientation and graph refinement to reduce spurious edges, while the domain knowledge guides pruning. Experiments on simulated data and a real advertising-use-case show RADICE outperforms correlation-based baselines and standard causal discovery methods, producing actionable root-cause sub-graphs and interpretable time-series plots.
Abstract
Root cause analysis is one of the most crucial operations in software reliability regarding system performance diagnostic. It aims to identify the root causes of system performance anomalies, allowing the resolution or the future prevention of issues that can cause millions of dollars in losses. Common existing approaches relying on data correlation or full domain expert knowledge are inaccurate or infeasible in most industrial cases, since correlation does not imply causation, and domain experts may not have full knowledge of complex and real-time systems. In this work, we define a novel causal domain knowledge model representing causal relations about the underlying system components to allow domain experts to contribute partial domain knowledge for root cause analysis. We then introduce RADICE, an algorithm that through the causal graph discovery, enhancement, refinement, and subtraction processes is able to output a root cause causal sub-graph showing the causal relations between the system components affected by the anomaly. We evaluated RADICE with simulated data and reported a real data use case, sharing the lessons we learned. The experiments show that RADICE provides better results than other baseline methods, including causal discovery algorithms and correlation based approaches for root cause analysis.
