Learning Probabilities of Causation from Finite Population Data
Shuai Wang, Song Jiang, Yizhou Sun, Judea Pearl, Ang Li
TL;DR
This work tackles the problem of estimating probabilities of causation, specifically the probability of necessity and sufficiency (PNS), for subpopulations with insufficient data by leveraging machine learning. A two-stage pipeline is introduced: (i) generate informer data from multiple structural causal models (SCMs) to obtain accurate PNS bounds for subpopulations with ample data, and (ii) train diverse ML models to predict these bounds for sparser subgroups, focusing on the PNS bounds under data scarcity. Empirically, a multilayer perceptron (MLP) with the Mish activation consistently achieves the best mean absolute error around 0.02 across four SCMs, with higher errors for more complex Mediator structures. The study also provides a synthetic, multi-SCM dataset and releases code to facilitate future research, highlighting practical implications for causal decision-making under limited data and outlining strategies to collect or aggregate data efficiently. Overall, the results demonstrate that learning-based approaches can extend causal quantities to subpopulations where direct estimation is impractical, bridging causal inference and machine learning for real-world counterfactual reasoning.
Abstract
Probabilities of causation play a crucial role in modern decision-making. This paper addresses the challenge of predicting probabilities of causation for subpopulations with \textbf{insufficient} data using machine learning models. Tian and Pearl first defined and derived tight bounds for three fundamental probabilities of causation: the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). However, estimating these probabilities requires both experimental and observational distributions specific to each subpopulation, which are often unavailable or impractical to obtain with limited population-level data. Therefore, for most subgroups, the amount of data they have is not enough to guarantee the accuracy of their probabilities. Hence, to estimate these probabilities for subpopulations with \textbf{insufficient} data, we propose using machine learning models that draw insights from subpopulations with sufficient data. Our evaluation of multiple machine learning models indicates that, given the population-level data and an appropriate choice of machine learning model and activation function, PNS can be effectively predicted. Through simulation studies on multiple Structured Causal Models (SCMs), we show that our multilayer perceptron (MLP) model with the Mish activation function achieves a mean absolute error (MAE) of approximately $0.02$ in predicting PNS for $32,768$ subpopulations across most SCMs using data from only $2,000$ subpopulations with known PNS values.
