Table of Contents
Fetching ...

Root Cause Analysis of Outliers with Missing Structural Knowledge

William Roy Orchard, Nastaran Okati, Sergio Hernan Garrido Mejia, Patrick Blöbaum, Dominik Janzing

TL;DR

This work tackles root-cause analysis when only a single anomalous sample is available and the causal graph is unknown or restricted to a polytree. It leverages information-theoretic anomaly scores to avoid estimating conditional probabilities, proving that marginal scores suffice for causal reasoning in polytrees and providing guarantees via SMOOTH TRAVERSAL (known graph) and SCORE ORDERING (unknown graph). The approach yields non-parametric p-value bounds and top-k guarantees, with competitive results on synthetic data and real cloud-services datasets, while noting limitations when moving beyond polytrees. Overall, the paper offers practical, theory-grounded RCA methods that scale to single-sample settings and nonparametric regimes. The results address a key bottleneck in RCA: identifying root causes with minimal data and minimal structural assumptions, enabling faster and more reliable diagnosis in real-world systems.

Abstract

The goal of Root Cause Analysis (RCA) is to explain why an anomaly occurred by identifying where the fault originated. Several recent works model the anomalous event as resulting from a change in the causal mechanism at the root cause, i.e., as a soft intervention. RCA is then the task of identifying which causal mechanism changed. In real-world applications, one often has either few or only a single sample from the post-intervention distribution: a severe limitation for most methods, which assume one knows or can estimate the distribution. However, even those that do not are statistically ill-posed due to the need to probe regression models in regions of low probability density. In this paper, we propose simple, efficient methods to overcome both difficulties in the case where there is a single root cause and the causal graph is a polytree. When one knows the causal graph, we give guarantees for a traversal algorithm that requires only marginal anomaly scores and does not depend on specifying an arbitrary anomaly score cut-off. When one does not know the causal graph, we show that the heuristic of identifying root causes as the variables with the highest marginal anomaly scores is causally justified. To this end, we prove that anomalies with small scores are unlikely to cause those with larger scores in polytrees and give upper bounds for the likelihood of causal pathways with non-monotonic anomaly scores.

Root Cause Analysis of Outliers with Missing Structural Knowledge

TL;DR

This work tackles root-cause analysis when only a single anomalous sample is available and the causal graph is unknown or restricted to a polytree. It leverages information-theoretic anomaly scores to avoid estimating conditional probabilities, proving that marginal scores suffice for causal reasoning in polytrees and providing guarantees via SMOOTH TRAVERSAL (known graph) and SCORE ORDERING (unknown graph). The approach yields non-parametric p-value bounds and top-k guarantees, with competitive results on synthetic data and real cloud-services datasets, while noting limitations when moving beyond polytrees. Overall, the paper offers practical, theory-grounded RCA methods that scale to single-sample settings and nonparametric regimes. The results address a key bottleneck in RCA: identifying root causes with minimal data and minimal structural assumptions, enabling faster and more reliable diagnosis in real-world systems.

Abstract

The goal of Root Cause Analysis (RCA) is to explain why an anomaly occurred by identifying where the fault originated. Several recent works model the anomalous event as resulting from a change in the causal mechanism at the root cause, i.e., as a soft intervention. RCA is then the task of identifying which causal mechanism changed. In real-world applications, one often has either few or only a single sample from the post-intervention distribution: a severe limitation for most methods, which assume one knows or can estimate the distribution. However, even those that do not are statistically ill-posed due to the need to probe regression models in regions of low probability density. In this paper, we propose simple, efficient methods to overcome both difficulties in the case where there is a single root cause and the causal graph is a polytree. When one knows the causal graph, we give guarantees for a traversal algorithm that requires only marginal anomaly scores and does not depend on specifying an arbitrary anomaly score cut-off. When one does not know the causal graph, we show that the heuristic of identifying root causes as the variables with the highest marginal anomaly scores is causally justified. To this end, we prove that anomalies with small scores are unlikely to cause those with larger scores in polytrees and give upper bounds for the likelihood of causal pathways with non-monotonic anomaly scores.
Paper Structure (43 sections, 16 theorems, 55 equations, 10 figures, 10 tables, 2 algorithms)

This paper contains 43 sections, 16 theorems, 55 equations, 10 figures, 10 tables, 2 algorithms.

Key Result

Lemma 3.2

$H^X_0$ can be rejected at level $p \leq e^{-S(x)}$.

Figures (10)

  • Figure 1: True positive rate for identifying the root cause against anomaly strength injected at the root cause.
  • Figure 3: Runtimes of the algorithms for the experiment in Fig. \ref{['fig:anomaly_vs_accuracy']}; that is 50 nodes (left), an SCM with 100 nodes (center) and one with 1k nodes (right) (refer to the generation process above). The boxplots are produced using the default implementation in Matplotlib Hunter2007matplotlib. Note the log scale in the vertical axis.
  • Figure 4: True positive rate for identifying the root cause against anomaly strength injected at the root cause, including RCD and $\varepsilon$-Diagnosis. Note that both RCD and $\varepsilon$-Diagnosis are given 100 samples from the anomalous period, while all other methods are given only one.
  • Figure 5: True positive rate for identifying the root cause against anomaly strength injected at the root cause, when all structural equations are linear.
  • Figure 6: True positive rate for identifying the root cause against anomaly strength injected at the root cause, when all causal graphs are polytrees. Note that RCD and $\varepsilon$-Diagnosis are given 100 samples from the anomalous period, while all other algorithms are given only one.
  • ...and 5 more figures

Theorems & Definitions (19)

  • Example 1: Linear cause-effect model
  • Definition 3.1: Bivariate score typicality
  • Lemma 3.2: $p$-value bound for marginal anomaly event.
  • Lemma 3.3: Anomalies rarely cause larger anomalies
  • Lemma 3.4: Score typicality probably holds approximately
  • Lemma 3.5: Injectivity and monotonicity imply score typicality
  • Lemma 3.6: Bound on $p$-value for independent anomalies
  • Lemma 3.7: Conditional anomaly scores are independent
  • Lemma 3.8: The joint score is an IT anomaly score
  • Theorem 3.9: $p$-value for joint anomaly event
  • ...and 9 more