Table of Contents
Fetching ...

Counterfactual-based Root Cause Analysis for Dynamical Systems

Juliane Weilbach, Sebastian Gerwinn, Karim Barsim, Martin Fränzle

TL;DR

This work tackles root-cause analysis for failures in dynamical systems through counterfactual reasoning on dynamic structural causal models (SCMs). It learns a nonlinear transition model via a residual neural network to capture time-evolving trajectories and derives counterfactual distributions under interventions on both structural equations and external influences. A tractable Shapley-value-based scoring scheme ranks interventions across time, enabling identification of key sub-systems driving observed faults. The method demonstrates improved root-cause identification on linear and nonlinear synthetic benchmarks and on a real river-flow dataset, highlighting practical utility for complex, time-dependent processes while noting limitations such as assuming a known graph and a single root cause.

Abstract

Identifying the underlying reason for a failing dynamic process or otherwise anomalous observation is a fundamental challenge, yet has numerous industrial applications. Identifying the failure-causing sub-system using causal inference, one can ask the question: "Would the observed failure also occur, if we had replaced the behaviour of a sub-system at a certain point in time with its normal behaviour?" To this end, a formal description of behaviour of the full system is needed in which such counterfactual questions can be answered. However, existing causal methods for root cause identification are typically limited to static settings and focusing on additive external influences causing failures rather than structural influences. In this paper, we address these problems by modelling the dynamic causal system using a Residual Neural Network and deriving corresponding counterfactual distributions over trajectories. We show quantitatively that more root causes are identified when an intervention is performed on the structural equation and the external influence, compared to an intervention on the external influence only. By employing an efficient approximation to a corresponding Shapley value, we also obtain a ranking between the different subsystems at different points in time being responsible for an observed failure, which is applicable in settings with large number of variables. We illustrate the effectiveness of the proposed method on a benchmark dynamic system as well as on a real world river dataset.

Counterfactual-based Root Cause Analysis for Dynamical Systems

TL;DR

This work tackles root-cause analysis for failures in dynamical systems through counterfactual reasoning on dynamic structural causal models (SCMs). It learns a nonlinear transition model via a residual neural network to capture time-evolving trajectories and derives counterfactual distributions under interventions on both structural equations and external influences. A tractable Shapley-value-based scoring scheme ranks interventions across time, enabling identification of key sub-systems driving observed faults. The method demonstrates improved root-cause identification on linear and nonlinear synthetic benchmarks and on a real river-flow dataset, highlighting practical utility for complex, time-dependent processes while noting limitations such as assuming a known graph and a single root cause.

Abstract

Identifying the underlying reason for a failing dynamic process or otherwise anomalous observation is a fundamental challenge, yet has numerous industrial applications. Identifying the failure-causing sub-system using causal inference, one can ask the question: "Would the observed failure also occur, if we had replaced the behaviour of a sub-system at a certain point in time with its normal behaviour?" To this end, a formal description of behaviour of the full system is needed in which such counterfactual questions can be answered. However, existing causal methods for root cause identification are typically limited to static settings and focusing on additive external influences causing failures rather than structural influences. In this paper, we address these problems by modelling the dynamic causal system using a Residual Neural Network and deriving corresponding counterfactual distributions over trajectories. We show quantitatively that more root causes are identified when an intervention is performed on the structural equation and the external influence, compared to an intervention on the external influence only. By employing an efficient approximation to a corresponding Shapley value, we also obtain a ranking between the different subsystems at different points in time being responsible for an observed failure, which is applicable in settings with large number of variables. We illustrate the effectiveness of the proposed method on a benchmark dynamic system as well as on a real world river dataset.
Paper Structure (20 sections, 14 equations, 5 figures, 2 tables)

This paper contains 20 sections, 14 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: This figure shows an overview of the individual steps of our method.
  • Figure 2: The figure shows the counterfactual samples for the FHN system with injected root cause at ($j=x_1,t=24$). The injected root cause disrupts the system observation heavily (black dashed line). However, the counterfactual intervention performed by our model NLin($S^j_t,N_t^j$) corrects the failure in both dimensions, such that it lies inside the threshold region (orange area).
  • Figure 3: The root cause was injected at a random node $j=x_1$ at $t=6$ with varying constants in $[1,10]$. The horizontal axis shows the injected constant in relation to the noise standard deviation denoted by $\sigma$. We report how many root causes could be identified in %.
  • Figure 4: With the geographical knowledge of the river flow, a summary graph can be inferred (Figure taken from pmlr-v130-budhathoki21a).
  • Figure 5: We show five counterfactual samples (for each station) of our model NLin($S^j_t,N_t^j$) with the intervention at the predicted root cause at 08:30 on 16.03.2019. Additionally, we illustrate the resulting Shapley values for each time point, showing that right before the failure occurs the Shapley values increase.

Theorems & Definitions (5)

  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition