Table of Contents
Fetching ...

Causal-discovery-based root-cause analysis and its application in time-series prediction error diagnosis

Hiroshi Yokoyama, Ryusei Shingaki, Kaneharu Nishino, Shohei Shimizu, Thong Pham

TL;DR

CD-RCA addresses the challenge of attributing prediction-error outliers in black-box ML by learning a surrogate causal model in additive noise form $X_p = g_p((X_q)_{q\in pa(p)}) + N_p$ and generating counterfactual errors to compute Shapley-based attributions. It learns a causal graph from data, fits regression and noise distributions, and uses these to simulate synthetic data in which each covariate's noise is randomized, allowing precise attribution via $\phi(p)$. The method is evaluated on synthetic time-series and a real river inflow case, where CD-RCA outperforms heuristic baselines (e.g., LIME, EIG, LC, GPA, z-score) and yields plausible, interpretable root causes consistent with domain knowledge. Limitations include dependence on causal sufficiency and potential degradation under graph misspecification; extensions like LPCMCI for latent confounders are proposed to broaden applicability in causal-insufficient settings.

Abstract

Recent rapid advancements of machine learning have greatly enhanced the accuracy of prediction models, but most models remain "black boxes", making prediction error diagnosis challenging, especially with outliers. This lack of transparency hinders trust and reliability in industrial applications. Heuristic attribution methods, while helpful, often fail to capture true causal relationships, leading to inaccurate error attributions. Various root-cause analysis methods have been developed using Shapley values, yet they typically require predefined causal graphs, limiting their applicability for prediction errors in machine learning models. To address these limitations, we introduce the Causal-Discovery-based Root-Cause Analysis (CD-RCA) method that estimates causal relationships between the prediction error and the explanatory variables, without needing a pre-defined causal graph. By simulating synthetic error data, CD-RCA can identify variable contributions to outliers in prediction errors by Shapley values. Extensive experiments show CD-RCA outperforms current heuristic attribution methods.

Causal-discovery-based root-cause analysis and its application in time-series prediction error diagnosis

TL;DR

CD-RCA addresses the challenge of attributing prediction-error outliers in black-box ML by learning a surrogate causal model in additive noise form and generating counterfactual errors to compute Shapley-based attributions. It learns a causal graph from data, fits regression and noise distributions, and uses these to simulate synthetic data in which each covariate's noise is randomized, allowing precise attribution via . The method is evaluated on synthetic time-series and a real river inflow case, where CD-RCA outperforms heuristic baselines (e.g., LIME, EIG, LC, GPA, z-score) and yields plausible, interpretable root causes consistent with domain knowledge. Limitations include dependence on causal sufficiency and potential degradation under graph misspecification; extensions like LPCMCI for latent confounders are proposed to broaden applicability in causal-insufficient settings.

Abstract

Recent rapid advancements of machine learning have greatly enhanced the accuracy of prediction models, but most models remain "black boxes", making prediction error diagnosis challenging, especially with outliers. This lack of transparency hinders trust and reliability in industrial applications. Heuristic attribution methods, while helpful, often fail to capture true causal relationships, leading to inaccurate error attributions. Various root-cause analysis methods have been developed using Shapley values, yet they typically require predefined causal graphs, limiting their applicability for prediction errors in machine learning models. To address these limitations, we introduce the Causal-Discovery-based Root-Cause Analysis (CD-RCA) method that estimates causal relationships between the prediction error and the explanatory variables, without needing a pre-defined causal graph. By simulating synthetic error data, CD-RCA can identify variable contributions to outliers in prediction errors by Shapley values. Extensive experiments show CD-RCA outperforms current heuristic attribution methods.

Paper Structure

This paper contains 20 sections, 18 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Root cause detection for prediction error outlier in time-series data. (A) The true causal graph. The prediction target variable is $Y = X_4$. The root-cause for the outlier sample is $X_1$. (B) Normalized attribution of each variable provided by each method. The probability densities in the violin plots were calculated using a kernel density method based on attribution values from $50$ trials. The error bar indicates the maximum and minimum values of the attributions. (C) True positive rate in identifying $X_1$ as the root cause of each method.
  • Figure 2: Effects of outlier magnitude and average total effect to target variable on the performance of CD-RCA. (A--C) The results of the root cause detection accuracy, relative to the changes in amplitude of $Z$ and graph edge weight $\beta$ in the model $F_1$. The causal graph diagram of each panel indicates the location of the exact root-cause ($Z$) and edge changes ($\beta$) for each simulation settings. (D--F) The results of the model $F_2$ in the same manner of (A--C).
  • Figure 3: Total effect to target variable $X_4$ in the models for training data. A) Average total effect: ATE in the model $F_1$, from $X_1$ to $X_4$, from $X_2$ to $X_4$, and from $X_3$ to $X_4$, respectively. B) ATE results of $F_2$ in the same manner in $F_1$. These results were obtained from causal models $F_1$ and $F_2$ by the dowhy-gcm package.
  • Figure 4: Map of the surrounding area of the Taisetsu dam. The numbering from $0$ to $6$ corresponds to the observation point $m$ of the covariates $q_{n}^{(m)}$ produced by the Sugawara tank model. The geographical information of these observation points is as follows. Upper stream of the Ishikari river: 0 and 2. Lower stream of the Ishikari river: 3. Other: 1,4,5 and 6.