Table of Contents
Fetching ...

On the Fly Detection of Root Causes from Observed Data with Application to IT Systems

Lei Zan, Charles K. Assaad, Emilie Devijver, Eric Gaussier, Ali Aït-Bachir

TL;DR

This work tackles root-cause analysis in threshold-based IT monitoring by transforming continuous metrics into binary thresholded signals $X≥r_x$ and modeling their event-driven propagation with a threshold-based dynamic causal framework. It introduces a threshold-based full-time causal graph (T-FTCG), a threshold-based summary graph (T-SCG), and a threshold-based dynamic structural causal model (T-DSCM), together with the T-RCA algorithm to identify true root causes from online anomalies. An agent-based extension (T-RCA-agent) further relaxes the one-intervention assumption, ensuring robust root-cause detection when interactions among root causes occur. Across synthetic and real IT datasets, the proposed methods achieve superior accuracy (notably higher F1-scores) and demonstrate practical utility for rapid incident mitigation in IT systems.

Abstract

This paper introduces a new structural causal model tailored for representing threshold-based IT systems and presents a new algorithm designed to rapidly detect root causes of anomalies in such systems. When root causes are not causally related, the method is proven to be correct; while an extension is proposed based on the intervention of an agent to relax this assumption. Our algorithm and its agent-based extension leverage causal discovery from offline data and engage in subgraph traversal when encountering new anomalies in online data. Our extensive experiments demonstrate the superior performance of our methods, even when applied to data generated from alternative structural causal models or real IT monitoring data.

On the Fly Detection of Root Causes from Observed Data with Application to IT Systems

TL;DR

This work tackles root-cause analysis in threshold-based IT monitoring by transforming continuous metrics into binary thresholded signals and modeling their event-driven propagation with a threshold-based dynamic causal framework. It introduces a threshold-based full-time causal graph (T-FTCG), a threshold-based summary graph (T-SCG), and a threshold-based dynamic structural causal model (T-DSCM), together with the T-RCA algorithm to identify true root causes from online anomalies. An agent-based extension (T-RCA-agent) further relaxes the one-intervention assumption, ensuring robust root-cause detection when interactions among root causes occur. Across synthetic and real IT datasets, the proposed methods achieve superior accuracy (notably higher F1-scores) and demonstrate practical utility for rapid incident mitigation in IT systems.

Abstract

This paper introduces a new structural causal model tailored for representing threshold-based IT systems and presents a new algorithm designed to rapidly detect root causes of anomalies in such systems. When root causes are not causally related, the method is proven to be correct; while an extension is proposed based on the intervention of an agent to relax this assumption. Our algorithm and its agent-based extension leverage causal discovery from offline data and engage in subgraph traversal when encountering new anomalies in online data. Our extensive experiments demonstrate the superior performance of our methods, even when applied to data generated from alternative structural causal models or real IT monitoring data.
Paper Structure (30 sections, 5 theorems, 3 equations, 12 figures, 1 algorithm)

This paper contains 30 sections, 5 theorems, 3 equations, 12 figures, 1 algorithm.

Key Result

Lemma 1

Let $\mathcal{M}$ be a T-DSCM associated to a T-FTCG $\mathcal{G}_{\text{ft}}$. If Assumptions assumption:cmc, assumption:adjacency_faithfulness, assumption:no_hidden_con, assumption:consistency are satisfied then $\mathcal{G}_{\text{ft}}$ is identifiable from the distribution induced by $\mathcal{M

Figures (12)

  • Figure 1: Example. Illustration of (a) a T-FTCG, (b) a T-SCG and (c) the mapping for the appearance time of anomalies on a system with four variables.
  • Figure 2: Overview of T-RCA. First step, on the offline dataset: a T-SCG is learned from $\mathcal{D}_{\text{off}}$. Second step, the anomalous T-SCG is deduced from the online dataset, as well as the appearance time. Last step, detection of the root causes using Lemmas \ref{['lemma:root_causes_forwards']} and \ref{['lemma:root_causes_SCC']}.
  • Figure 3: Average F1-score and its variance across 50 simulations are depicted for simulated data. The length of $\mathcal{D}_{\text{on}}$ ranges from 10 to 200, generated from a T-DSCM (a and d), a DSCM with root causes experiencing changes in causal coefficients (b and e), and a DSCM with root causes undergoing changes in noise (c and f). Each model is assessed under two settings: one adhering to Assumption \ref{['assumption:one_intervention']} (a, b, and c) and another that violates Assumption \ref{['assumption:one_intervention']} (d, e, and f).
  • Figure 4: Real IT monitoring data: (a) the SCG provided by the experts, on the normal regime, where root causes correspond to the vertices with thick borders (PMDB and ESB); (b) the T-SCG learned by T-RCA, where inferred root causes correspond to purple vertices (PMDB and ESB); (c) F1-score for the IT monitoring data, varying the lengths of $\mathcal{D}_{\text{on}}$ from 10 to 100.
  • Figure 5: (a) demonstrates a scenario where time and dependence fail to detect all root causes. (b) demonstrates a scenario where time and conditional dependence on a single variable fail to detect all root causes.
  • ...and 7 more figures

Theorems & Definitions (19)

  • Definition 1: Time series
  • Definition 2: Binary thresholding of time series
  • Definition 3: Threshold-based full-time causal graph, T-FTCG
  • Definition 4: Threshold-based dynamic structural causal model, T-DSCM
  • Definition 5: Anomaly and root cause
  • Definition 6: Threshold-based Summary Causal Graph, T-SCG
  • Definition 7: Appearance time of anomalies
  • Lemma 1
  • Definition 8: Strongly connected component, (SCC)
  • Lemma 2
  • ...and 9 more