Table of Contents
Fetching ...

Root Cause Analysis of Outliers in Unknown Cyclic Graphs

Daniela Schkoda, Dominik Janzing

TL;DR

We address root cause analysis in unknown cyclic causal graphs under linear SEMs by leveraging an invariant precision-matrix transformation. The key idea is to apply the precision matrix $\Theta_{XX}$ to the anomalous observation $\tilde{x}$ to reveal the root causes and their cycle-parents, enabling a shortlisting that remains valid without a known graph. The framework extends to latent variables via projection and zig-zag latent paths, and is implemented with false discovery rate control (FDRC) using e-values for robust selection. Empirical results on simulations and real cloud data (e.g., PetShop) demonstrate improved accuracy and scalability over prior DAG-based approaches, highlighting practical applicability for microservice RCA with cycles.

Abstract

We study the propagation of outliers in cyclic causal graphs with linear structural equations, tracing them back to one or several "root cause" nodes. We show that it is possible to identify a short list of potential root causes provided that the perturbation is sufficiently strong and propagates according to the same structural equations as in the normal mode. This shortlist consists of the true root causes together with those of its parents lying on a cycle with the root cause. Notably, our method does not require prior knowledge of the causal graph.

Root Cause Analysis of Outliers in Unknown Cyclic Graphs

TL;DR

We address root cause analysis in unknown cyclic causal graphs under linear SEMs by leveraging an invariant precision-matrix transformation. The key idea is to apply the precision matrix to the anomalous observation to reveal the root causes and their cycle-parents, enabling a shortlisting that remains valid without a known graph. The framework extends to latent variables via projection and zig-zag latent paths, and is implemented with false discovery rate control (FDRC) using e-values for robust selection. Empirical results on simulations and real cloud data (e.g., PetShop) demonstrate improved accuracy and scalability over prior DAG-based approaches, highlighting practical applicability for microservice RCA with cycles.

Abstract

We study the propagation of outliers in cyclic causal graphs with linear structural equations, tracing them back to one or several "root cause" nodes. We show that it is possible to identify a short list of potential root causes provided that the perturbation is sufficiently strong and propagates according to the same structural equations as in the normal mode. This shortlist consists of the true root causes together with those of its parents lying on a cycle with the root cause. Notably, our method does not require prior knowledge of the causal graph.

Paper Structure

This paper contains 31 sections, 8 theorems, 66 equations, 12 figures, 1 table, 1 algorithm.

Key Result

Lemma 2.1

The total causal effect from $j$ on $i$ multiplies direct causal effects along paths from $j$ to $i$: where $\mathcal{C_\pi} = \{(C_1, \dots, C_q):$ collection of disjoint cycles in $\tilde{G}$ not using any node in $\pi; q\in \mathbb{N}_0.\}$.

Figures (12)

  • Figure 1: Projecting a linear SEM to an SEM for only the observed nodes.
  • Figure 2: The boxplots illustrate the rank of the true root cause score among the scores of all nodes. Cyclic (Graphical Lasso) achieves the strongest performance, whereas Cyclic (Inverse Covariance) performs comparably to Cholesky. As expected, Z-Score yields the weakest results. Results for Cholesky with $p=100$ are omitted due to computational infeasibility.
  • Figure 3: The rank of the true root cause's score is shown, with the dotted line indicating the total number of services. The dots connected by lines belong to the same incident. Notably, all methods show similar performance, which is often poor at the beginning and occasionally at the end of each incident, while it improves in the middle. This could stem from a time gap between the incident triggering and its impact on the system and a potential resolution by the end. Indeed, at the beginning and the end, the true root cause is often not even anomalous; compare the appendix for details. In contrast, at the middle time points, the root causes often have very high outlier scores, which is probably why the Z-Score baseline performs quite well.
  • Figure 4: Effect of the threshold parameter $\tau$ on RCA performance for different intervention strengths when using Cyclic (Graphical Lasso) on $50$ nodes with the same data simulation setup as in the main paper. The weaker the interventions, the lower $\tau$ values are preferable.
  • Figure 5: Left: Performance of Cyclic (Graphical) across different regularization parameters $\alpha$ in the Graphical Lasso. The data are generated as described in the main paper, but here all graph types and intervention strengths are combined into a single plot. Note that the Graphical Lasso may fail for small $\alpha$, since the regularization is insufficient to stabilize inversion of ill-conditioned matrices. Only successful replications are included here. 'CV' denotes selection of $\alpha$ via cross-validation. Right: Corresponding success rate of Graphical Lasso.
  • ...and 7 more figures

Theorems & Definitions (12)

  • Lemma 2.1: Path matrix
  • Example 1
  • Theorem 3.1: Shortlist of root causes
  • Theorem 3.2: Even shorter list of root causes
  • Lemma 3.3: Remove nodes
  • proof
  • Example 2
  • Lemma 3.4: Sparse noise precision matrix
  • Theorem 3.5: Shortlist of root causes
  • Example 3: Find root cause propagation route
  • ...and 2 more