Table of Contents
Fetching ...

Industrial-Grade Time-Dependent Counterfactual Root Cause Analysis through the Unanticipated Point of Incipient Failure: a Proof of Concept

Alexandre Trilla, Rajesh Rajendran, Ossee Yiboe, Quentin Possamaï, Nenad Mijatovic, Jordi Vitrià

TL;DR

The paper tackles root-cause analysis in industrial multivariate time series by adopting a counterfactual, causality-based approach centered on the Point of Incipient Failure ($T$). It delivers an end-to-end pipeline—data transformation, time-aware structure learning via Dynamic Causal Bayesian Networks, and Abduction-Action-Prediction–driven counterfactuals—to locate root causes and provide algorithmic recourse. In a synthetic 4-variable setup, the authors show that time-lag augmented PC-based discovery can recover the causal structure (with RMSE $=0.1247$ for alarm prediction) and yield plausible, likelihood-ranked paths to the failure, along with counterfactual distributions that illustrate how changing the root-cause variable at $T$ could have averted the anomaly. The work advances industrial predictive maintenance by offering a complete, ISO-aligned RCA framework capable of explaining, predicting, and recourse-optimizing root causes in dynamic, real-world settings.

Abstract

This paper describes the development of a counterfactual Root Cause Analysis diagnosis approach for an industrial multivariate time series environment. It drives the attention toward the Point of Incipient Failure, which is the moment in time when the anomalous behavior is first observed, and where the root cause is assumed to be found before the issue propagates. The paper presents the elementary but essential concepts of the solution and illustrates them experimentally on a simulated setting. Finally, it discusses avenues of improvement for the maturity of the causal technology to meet the robustness challenges of increasingly complex environments in the industry.

Industrial-Grade Time-Dependent Counterfactual Root Cause Analysis through the Unanticipated Point of Incipient Failure: a Proof of Concept

TL;DR

The paper tackles root-cause analysis in industrial multivariate time series by adopting a counterfactual, causality-based approach centered on the Point of Incipient Failure (). It delivers an end-to-end pipeline—data transformation, time-aware structure learning via Dynamic Causal Bayesian Networks, and Abduction-Action-Prediction–driven counterfactuals—to locate root causes and provide algorithmic recourse. In a synthetic 4-variable setup, the authors show that time-lag augmented PC-based discovery can recover the causal structure (with RMSE for alarm prediction) and yield plausible, likelihood-ranked paths to the failure, along with counterfactual distributions that illustrate how changing the root-cause variable at could have averted the anomaly. The work advances industrial predictive maintenance by offering a complete, ISO-aligned RCA framework capable of explaining, predicting, and recourse-optimizing root causes in dynamic, real-world settings.

Abstract

This paper describes the development of a counterfactual Root Cause Analysis diagnosis approach for an industrial multivariate time series environment. It drives the attention toward the Point of Incipient Failure, which is the moment in time when the anomalous behavior is first observed, and where the root cause is assumed to be found before the issue propagates. The paper presents the elementary but essential concepts of the solution and illustrates them experimentally on a simulated setting. Finally, it discusses avenues of improvement for the maturity of the causal technology to meet the robustness challenges of increasingly complex environments in the industry.
Paper Structure (29 sections, 8 equations, 4 figures, 2 tables)

This paper contains 29 sections, 8 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Time-implicit summary graph for a synthetic system that could describe a 2-out-of-3 redundancy. Note the time-confounded associations among the $X$ channels.
  • Figure 2: Timeline of system condition evolution showing a failure on Channel 1.
  • Figure 3: Time-explicit ground truth graph.
  • Figure 4: Distributions of potential alarm outcomes for a range of counterfactuals on the root cause channel. The descriptive averages in terms of means and standard deviations for the (discrete) failure random variable are shown.