Causal Explanations for Disparate Trends: Where and Why?
Tal Blau, Brit Youngmann, Anna Fariha, Yuval Moskovitch
TL;DR
ExDis formalizes and solves the problem of automatically discovering causal explanations for disparities between two groups by identifying subpopulations where disparities are pronounced and uncovering causal factors that differentially affect the groups within those subpopulations. The framework combines Apriori-based subpopulation mining, a CauSumX-inspired treatment-mining step, and a diversity-aware greedy search, all while enforcing coverage and non-redundancy. Empirical results on Stack Overflow, ACS, and MEPS show ExDis produces higher-quality, locally informative explanations and scales to large, high-dimensional data, outperforming global or fixed-treatment baselines. The work demonstrates practical value for bias debugging and policy-making, while acknowledging limitations and outlining directions for extension to multi-relational data and multi-group disparities.
Abstract
During data analysis, we are often perplexed by certain disparities observed between two groups of interest within a dataset. To better understand an observed disparity, we need explanations that can pinpoint the data regions where the disparity is most pronounced, along with its causes, i.e., factors that alleviate or exacerbate the disparity. This task is complex and tedious, particularly for large and high-dimensional datasets, demanding an automatic system for discovering explanations (data regions and causes) of an observed disparity. It is critical that explanations for disparities are not only interpretable but also actionable-enabling users to make informed, data-driven decisions. This requires explanations to go beyond surface-level correlations and instead capture causal relationships. We introduce ExDis, a framework for discovering causal Explanations for Disparities between two groups of interest. ExDis identifies data regions (subpopulations) where disparities are most pronounced (or reversed), and associates specific factors that causally contribute to the disparity within each identified data region. We formally define the ExDis framework and the associated optimization problem, analyze its complexity, and develop an efficient algorithm to solve the problem. Through extensive experiments over three real-world datasets, we demonstrate that ExDis generates meaningful causal explanations, outperforms prior methods, and scales effectively to handle large, high-dimensional datasets.
