Table of Contents
Fetching ...

Learning Probabilities of Causation from Finite Population Data

Shuai Wang, Song Jiang, Yizhou Sun, Judea Pearl, Ang Li

TL;DR

This work tackles the problem of estimating probabilities of causation, specifically the probability of necessity and sufficiency (PNS), for subpopulations with insufficient data by leveraging machine learning. A two-stage pipeline is introduced: (i) generate informer data from multiple structural causal models (SCMs) to obtain accurate PNS bounds for subpopulations with ample data, and (ii) train diverse ML models to predict these bounds for sparser subgroups, focusing on the PNS bounds under data scarcity. Empirically, a multilayer perceptron (MLP) with the Mish activation consistently achieves the best mean absolute error around 0.02 across four SCMs, with higher errors for more complex Mediator structures. The study also provides a synthetic, multi-SCM dataset and releases code to facilitate future research, highlighting practical implications for causal decision-making under limited data and outlining strategies to collect or aggregate data efficiently. Overall, the results demonstrate that learning-based approaches can extend causal quantities to subpopulations where direct estimation is impractical, bridging causal inference and machine learning for real-world counterfactual reasoning.

Abstract

Probabilities of causation play a crucial role in modern decision-making. This paper addresses the challenge of predicting probabilities of causation for subpopulations with \textbf{insufficient} data using machine learning models. Tian and Pearl first defined and derived tight bounds for three fundamental probabilities of causation: the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). However, estimating these probabilities requires both experimental and observational distributions specific to each subpopulation, which are often unavailable or impractical to obtain with limited population-level data. Therefore, for most subgroups, the amount of data they have is not enough to guarantee the accuracy of their probabilities. Hence, to estimate these probabilities for subpopulations with \textbf{insufficient} data, we propose using machine learning models that draw insights from subpopulations with sufficient data. Our evaluation of multiple machine learning models indicates that, given the population-level data and an appropriate choice of machine learning model and activation function, PNS can be effectively predicted. Through simulation studies on multiple Structured Causal Models (SCMs), we show that our multilayer perceptron (MLP) model with the Mish activation function achieves a mean absolute error (MAE) of approximately $0.02$ in predicting PNS for $32,768$ subpopulations across most SCMs using data from only $2,000$ subpopulations with known PNS values.

Learning Probabilities of Causation from Finite Population Data

TL;DR

This work tackles the problem of estimating probabilities of causation, specifically the probability of necessity and sufficiency (PNS), for subpopulations with insufficient data by leveraging machine learning. A two-stage pipeline is introduced: (i) generate informer data from multiple structural causal models (SCMs) to obtain accurate PNS bounds for subpopulations with ample data, and (ii) train diverse ML models to predict these bounds for sparser subgroups, focusing on the PNS bounds under data scarcity. Empirically, a multilayer perceptron (MLP) with the Mish activation consistently achieves the best mean absolute error around 0.02 across four SCMs, with higher errors for more complex Mediator structures. The study also provides a synthetic, multi-SCM dataset and releases code to facilitate future research, highlighting practical implications for causal decision-making under limited data and outlining strategies to collect or aggregate data efficiently. Overall, the results demonstrate that learning-based approaches can extend causal quantities to subpopulations where direct estimation is impractical, bridging causal inference and machine learning for real-world counterfactual reasoning.

Abstract

Probabilities of causation play a crucial role in modern decision-making. This paper addresses the challenge of predicting probabilities of causation for subpopulations with \textbf{insufficient} data using machine learning models. Tian and Pearl first defined and derived tight bounds for three fundamental probabilities of causation: the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). However, estimating these probabilities requires both experimental and observational distributions specific to each subpopulation, which are often unavailable or impractical to obtain with limited population-level data. Therefore, for most subgroups, the amount of data they have is not enough to guarantee the accuracy of their probabilities. Hence, to estimate these probabilities for subpopulations with \textbf{insufficient} data, we propose using machine learning models that draw insights from subpopulations with sufficient data. Our evaluation of multiple machine learning models indicates that, given the population-level data and an appropriate choice of machine learning model and activation function, PNS can be effectively predicted. Through simulation studies on multiple Structured Causal Models (SCMs), we show that our multilayer perceptron (MLP) model with the Mish activation function achieves a mean absolute error (MAE) of approximately in predicting PNS for subpopulations across most SCMs using data from only subpopulations with known PNS values.

Paper Structure

This paper contains 43 sections, 1 theorem, 22 equations, 9 figures, 1 table.

Key Result

Theorem 5

If $Y$ is monotonic relative to $X$, then PNS, PN, and PS are all identifiable, and

Figures (9)

  • Figure 1: Framework for Causal Data Generation and Machine Learning Prediction.
  • Figure 2: Different SCMs in this study.
  • Figure 3: Confusion matrices of MLP (Mish) for both lower and upper bounds on different datasets.
  • Figure 4: Comparison between true values and predictions of MLP (Mish) for both lower and upper bounds on different datasets.
  • Figure 5: Different SCMs in this study.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Definition 1: Probability of necessity (PN)
  • Definition 2: Probability of sufficiency (PS)
  • Definition 3: Probability of necessity and sufficiency (PNS)
  • Definition 4
  • Theorem 5