On the Fly Detection of Root Causes from Observed Data with Application to IT Systems
Lei Zan, Charles K. Assaad, Emilie Devijver, Eric Gaussier, Ali Aït-Bachir
TL;DR
This work tackles root-cause analysis in threshold-based IT monitoring by transforming continuous metrics into binary thresholded signals $X≥r_x$ and modeling their event-driven propagation with a threshold-based dynamic causal framework. It introduces a threshold-based full-time causal graph (T-FTCG), a threshold-based summary graph (T-SCG), and a threshold-based dynamic structural causal model (T-DSCM), together with the T-RCA algorithm to identify true root causes from online anomalies. An agent-based extension (T-RCA-agent) further relaxes the one-intervention assumption, ensuring robust root-cause detection when interactions among root causes occur. Across synthetic and real IT datasets, the proposed methods achieve superior accuracy (notably higher F1-scores) and demonstrate practical utility for rapid incident mitigation in IT systems.
Abstract
This paper introduces a new structural causal model tailored for representing threshold-based IT systems and presents a new algorithm designed to rapidly detect root causes of anomalies in such systems. When root causes are not causally related, the method is proven to be correct; while an extension is proposed based on the intervention of an agent to relax this assumption. Our algorithm and its agent-based extension leverage causal discovery from offline data and engage in subgraph traversal when encountering new anomalies in online data. Our extensive experiments demonstrate the superior performance of our methods, even when applied to data generated from alternative structural causal models or real IT monitoring data.
