Table of Contents
Fetching ...

Evolutionary Causal Discovery with Relative Impact Stratification for Interpretable Data Analysis

Ou Deng, Shoji Nishimura, Atsushi Ogihara, Qun Jin

TL;DR

This paper addresses the challenge of causal analysis in healthcare data where traditional causal discovery and SEM can be limited by small samples and interpretability. It introduces Evolutionary Causal Discovery (ECD), which combines Genetic Programming Symbolic Regression (GPSR) with Relative Impact Stratification (RIS) to uncover interpretable mathematical relationships and quantify the relative impact of predictors. The key contributions are the RIS-based expression simplification, the expression-tree visualization for unknown causal links, and demonstration on synthetic data and real EHR BMI data, with results aligning with SEM and SHAP analyses. The findings show that ECD maintains high accuracy and stability under noise, provides counterfactual reasoning capabilities through RIS, and offers a practical, interpretable framework for complex healthcare datasets with potential for broader domain applications.

Abstract

This study proposes Evolutionary Causal Discovery (ECD) for causal discovery that tailors response variables, predictor variables, and corresponding operators to research datasets. Utilizing genetic programming for variable relationship parsing, the method proceeds with the Relative Impact Stratification (RIS) algorithm to assess the relative impact of predictor variables on the response variable, facilitating expression simplification and enhancing the interpretability of variable relationships. ECD proposes an expression tree to visualize the RIS results, offering a differentiated depiction of unknown causal relationships compared to conventional causal discovery. The ECD method represents an evolution and augmentation of existing causal discovery methods, providing an interpretable approach for analyzing variable relationships in complex systems, particularly in healthcare settings with Electronic Health Record (EHR) data. Experiments on both synthetic and real-world EHR datasets demonstrate the efficacy of ECD in uncovering patterns and mechanisms among variables, maintaining high accuracy and stability across different noise levels. On the real-world EHR dataset, ECD reveals the intricate relationships between the response variable and other predictive variables, aligning with the results of structural equation modeling and shapley additive explanations analyses.

Evolutionary Causal Discovery with Relative Impact Stratification for Interpretable Data Analysis

TL;DR

This paper addresses the challenge of causal analysis in healthcare data where traditional causal discovery and SEM can be limited by small samples and interpretability. It introduces Evolutionary Causal Discovery (ECD), which combines Genetic Programming Symbolic Regression (GPSR) with Relative Impact Stratification (RIS) to uncover interpretable mathematical relationships and quantify the relative impact of predictors. The key contributions are the RIS-based expression simplification, the expression-tree visualization for unknown causal links, and demonstration on synthetic data and real EHR BMI data, with results aligning with SEM and SHAP analyses. The findings show that ECD maintains high accuracy and stability under noise, provides counterfactual reasoning capabilities through RIS, and offers a practical, interpretable framework for complex healthcare datasets with potential for broader domain applications.

Abstract

This study proposes Evolutionary Causal Discovery (ECD) for causal discovery that tailors response variables, predictor variables, and corresponding operators to research datasets. Utilizing genetic programming for variable relationship parsing, the method proceeds with the Relative Impact Stratification (RIS) algorithm to assess the relative impact of predictor variables on the response variable, facilitating expression simplification and enhancing the interpretability of variable relationships. ECD proposes an expression tree to visualize the RIS results, offering a differentiated depiction of unknown causal relationships compared to conventional causal discovery. The ECD method represents an evolution and augmentation of existing causal discovery methods, providing an interpretable approach for analyzing variable relationships in complex systems, particularly in healthcare settings with Electronic Health Record (EHR) data. Experiments on both synthetic and real-world EHR datasets demonstrate the efficacy of ECD in uncovering patterns and mechanisms among variables, maintaining high accuracy and stability across different noise levels. On the real-world EHR dataset, ECD reveals the intricate relationships between the response variable and other predictive variables, aligning with the results of structural equation modeling and shapley additive explanations analyses.
Paper Structure (28 sections, 4 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 28 sections, 4 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Synthetic dataset with predefined causal relationships among variables, i.e. unknown ground truth. As a simple example for the methodological test, predictive variables $A$ and $B$ follow normal distributions $A \sim \mathcal{N}(1,2)$ and $B \sim \mathcal{N}(2,1)$, respectively. Derived variables are $C = A + B$, $D = 2A + 3$, and response variable $Z = B + \frac{C}{D}$. Sample size $n = 500$, with noise levels at 0%, 2%, and 5%.
  • Figure 2: Experimental results of selected major causal discovery methods and ECD on the synthetic dataset, as shown in Fig. \ref{['fig: syn_1']}.
  • Figure 3: Evaluation of ECD training procedures on the experimental synthetic dataset. The figure illustrates the evolutionary trajectories of minimal fitness and genetic diversity, depicted as curves of distinct colors, across ten experimental runs. Each experiment is conducted with a set duration of 30 generations.
  • Figure 4: Expression tree of ECD analysis of the experimental EHR dataset. Box nodes represent predictive variables situated at the topmost leaf nodes of the expression tree, whereas ellipse nodes denote symbolic regression operators constituting the internal nodes of the tree.
  • Figure 5: Exploratory analysis conducted via RIS within the ECD method. A perturbation analysis of 5% positive as an example was performed on the predictive variable 'SmokerStatus' with the 1st quartile of predictive variables.
  • ...and 2 more figures