Table of Contents
Fetching ...

Causal Explanations for Disparate Trends: Where and Why?

Tal Blau, Brit Youngmann, Anna Fariha, Yuval Moskovitch

TL;DR

ExDis formalizes and solves the problem of automatically discovering causal explanations for disparities between two groups by identifying subpopulations where disparities are pronounced and uncovering causal factors that differentially affect the groups within those subpopulations. The framework combines Apriori-based subpopulation mining, a CauSumX-inspired treatment-mining step, and a diversity-aware greedy search, all while enforcing coverage and non-redundancy. Empirical results on Stack Overflow, ACS, and MEPS show ExDis produces higher-quality, locally informative explanations and scales to large, high-dimensional data, outperforming global or fixed-treatment baselines. The work demonstrates practical value for bias debugging and policy-making, while acknowledging limitations and outlining directions for extension to multi-relational data and multi-group disparities.

Abstract

During data analysis, we are often perplexed by certain disparities observed between two groups of interest within a dataset. To better understand an observed disparity, we need explanations that can pinpoint the data regions where the disparity is most pronounced, along with its causes, i.e., factors that alleviate or exacerbate the disparity. This task is complex and tedious, particularly for large and high-dimensional datasets, demanding an automatic system for discovering explanations (data regions and causes) of an observed disparity. It is critical that explanations for disparities are not only interpretable but also actionable-enabling users to make informed, data-driven decisions. This requires explanations to go beyond surface-level correlations and instead capture causal relationships. We introduce ExDis, a framework for discovering causal Explanations for Disparities between two groups of interest. ExDis identifies data regions (subpopulations) where disparities are most pronounced (or reversed), and associates specific factors that causally contribute to the disparity within each identified data region. We formally define the ExDis framework and the associated optimization problem, analyze its complexity, and develop an efficient algorithm to solve the problem. Through extensive experiments over three real-world datasets, we demonstrate that ExDis generates meaningful causal explanations, outperforms prior methods, and scales effectively to handle large, high-dimensional datasets.

Causal Explanations for Disparate Trends: Where and Why?

TL;DR

ExDis formalizes and solves the problem of automatically discovering causal explanations for disparities between two groups by identifying subpopulations where disparities are pronounced and uncovering causal factors that differentially affect the groups within those subpopulations. The framework combines Apriori-based subpopulation mining, a CauSumX-inspired treatment-mining step, and a diversity-aware greedy search, all while enforcing coverage and non-redundancy. Empirical results on Stack Overflow, ACS, and MEPS show ExDis produces higher-quality, locally informative explanations and scales to large, high-dimensional data, outperforming global or fixed-treatment baselines. The work demonstrates practical value for bias debugging and policy-making, while acknowledging limitations and outlining directions for extension to multi-relational data and multi-group disparities.

Abstract

During data analysis, we are often perplexed by certain disparities observed between two groups of interest within a dataset. To better understand an observed disparity, we need explanations that can pinpoint the data regions where the disparity is most pronounced, along with its causes, i.e., factors that alleviate or exacerbate the disparity. This task is complex and tedious, particularly for large and high-dimensional datasets, demanding an automatic system for discovering explanations (data regions and causes) of an observed disparity. It is critical that explanations for disparities are not only interpretable but also actionable-enabling users to make informed, data-driven decisions. This requires explanations to go beyond surface-level correlations and instead capture causal relationships. We introduce ExDis, a framework for discovering causal Explanations for Disparities between two groups of interest. ExDis identifies data regions (subpopulations) where disparities are most pronounced (or reversed), and associates specific factors that causally contribute to the disparity within each identified data region. We formally define the ExDis framework and the associated optimization problem, analyze its complexity, and develop an efficient algorithm to solve the problem. Through extensive experiments over three real-world datasets, we demonstrate that ExDis generates meaningful causal explanations, outperforms prior methods, and scales effectively to handle large, high-dimensional datasets.

Paper Structure

This paper contains 35 sections, 1 theorem, 6 equations, 5 figures, 17 tables, 1 algorithm.

Key Result

Proposition 4.1

Given a set of candidate disparity explanations $\Phi_c$, a budget $k$, a support threshold $\sigma$, a similarity threshold $\tau$, and a bound $B$, determining whether $\exists\Phi \subseteq \Phi_c$ s.t $|\Phi| \leq k$, $\forall \phi_i\in \Phi,~ support(\phi) \geq \sigma$, $\forall \phi_i,\phi_j\i

Figures (5)

  • Figure 1: Partial causal DAG for the Stack Overflow dataset.
  • Figure 2: Effect of various system parameters on the disparity score. (a) & (b) The absolute disparity scores are reported here to show direct impact of the similarity threshold $\tau$ and the number of clusters on the disparity score $\Delta$. (c) Effect of casual DAG modification. Disparity scores here are shown as a relative value w.r.t the disparity score of CDI youngmann2023causal (which ExDis uses).
  • Figure 3: Effects of various parameters on runtime: (left) the budget parameter $k$, (center) number of attributes, and (right) fraction of data.
  • Figure 4: Effect of various settings of using optimization techniques on runtime across three datasets. Note that the y-axis is in log scale.
  • Figure 5: Reduction Example

Theorems & Definitions (12)

  • Example 1.1: Investigating a disparate trend
  • Example 1.2: Debugging bias
  • Example 1.3: Discovering reverse trends
  • Example 3.1
  • Example 3.2
  • Definition 4.1: Pattern
  • Example 4.1
  • Example 4.2
  • Definition 4.2: disparity explanation
  • Example 4.3
  • ...and 2 more