Table of Contents
Fetching ...

Revisiting Counterfactual Regression through the Lens of Gromov-Wasserstein Information Bottleneck

Hao Yang, Zexu Sun, Hongteng Xu, Xu Chen

TL;DR

Problem: selection bias in CFR-based ITE estimation; approach: GWIB reframes the CFR encoder as an information bottleneck and binds the kernelized mutual information with a GW-based regularizer that includes a GW term and a fused FGW term to enforce consistent cross-group correspondence while avoiding trivial latent encodings. Contributions: (i) a theoretical bound linking $\hat{I}_{\kappa,N}(Z,X;\phi)$ to $MG_{GW_2}(\hat{\rho}_N,\phi)$ and its GW interpretation; (ii) a practical bi-level optimization with alternating CG for the transport plan and SGD for model parameters; (iii) extensive experiments on IHDP and ACIC showing consistent gains over state-of-the-art CFR methods; (iv) public release of code. Significance: provides a principled OT-based mechanism to mitigate selection bias and over-enforcing balance in CFR while preserving information necessary for ITE estimation.

Abstract

As a promising individualized treatment effect (ITE) estimation method, counterfactual regression (CFR) maps individuals' covariates to a latent space and predicts their counterfactual outcomes. However, the selection bias between control and treatment groups often imbalances the two groups' latent distributions and negatively impacts this method's performance. In this study, we revisit counterfactual regression through the lens of information bottleneck and propose a novel learning paradigm called Gromov-Wasserstein information bottleneck (GWIB). In this paradigm, we learn CFR by maximizing the mutual information between covariates' latent representations and outcomes while penalizing the kernelized mutual information between the latent representations and the covariates. We demonstrate that the upper bound of the penalty term can be implemented as a new regularizer consisting of $i)$ the fused Gromov-Wasserstein distance between the latent representations of different groups and $ii)$ the gap between the transport cost generated by the model and the cross-group Gromov-Wasserstein distance between the latent representations and the covariates. GWIB effectively learns the CFR model through alternating optimization, suppressing selection bias while avoiding trivial latent distributions. Experiments on ITE estimation tasks show that GWIB consistently outperforms state-of-the-art CFR methods. To promote the research community, we release our project at https://github.com/peteryang1031/Causal-GWIB.

Revisiting Counterfactual Regression through the Lens of Gromov-Wasserstein Information Bottleneck

TL;DR

Problem: selection bias in CFR-based ITE estimation; approach: GWIB reframes the CFR encoder as an information bottleneck and binds the kernelized mutual information with a GW-based regularizer that includes a GW term and a fused FGW term to enforce consistent cross-group correspondence while avoiding trivial latent encodings. Contributions: (i) a theoretical bound linking to and its GW interpretation; (ii) a practical bi-level optimization with alternating CG for the transport plan and SGD for model parameters; (iii) extensive experiments on IHDP and ACIC showing consistent gains over state-of-the-art CFR methods; (iv) public release of code. Significance: provides a principled OT-based mechanism to mitigate selection bias and over-enforcing balance in CFR while preserving information necessary for ITE estimation.

Abstract

As a promising individualized treatment effect (ITE) estimation method, counterfactual regression (CFR) maps individuals' covariates to a latent space and predicts their counterfactual outcomes. However, the selection bias between control and treatment groups often imbalances the two groups' latent distributions and negatively impacts this method's performance. In this study, we revisit counterfactual regression through the lens of information bottleneck and propose a novel learning paradigm called Gromov-Wasserstein information bottleneck (GWIB). In this paradigm, we learn CFR by maximizing the mutual information between covariates' latent representations and outcomes while penalizing the kernelized mutual information between the latent representations and the covariates. We demonstrate that the upper bound of the penalty term can be implemented as a new regularizer consisting of the fused Gromov-Wasserstein distance between the latent representations of different groups and the gap between the transport cost generated by the model and the cross-group Gromov-Wasserstein distance between the latent representations and the covariates. GWIB effectively learns the CFR model through alternating optimization, suppressing selection bias while avoiding trivial latent distributions. Experiments on ITE estimation tasks show that GWIB consistently outperforms state-of-the-art CFR methods. To promote the research community, we release our project at https://github.com/peteryang1031/Causal-GWIB.
Paper Structure (31 sections, 3 theorems, 30 equations, 6 figures, 4 tables, 2 algorithms)

This paper contains 31 sections, 3 theorems, 30 equations, 6 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Assume that $(\mathcal{X},d_{X})$ and $(\mathcal{Z},d_Z)$ are two bounded spaces, whose diameters are denoted as $\text{Diam}_{\mathcal{X}}$ and $\text{Diam}_{\mathcal{Z}}$, respectively. Given samples $\bm{X}=\{x_{n}\}_{n=1}^{N}$ and corresponding $\bm{Z}=\{z_n\}_{n=1}^N$, for the empirical kerneli where $\bm{D}_{X}=[d_X(x_{m},x_{n})]\in\mathbb{R}^{N\times N}$ and $\bm{D}_{Z}=[d_Z(z_{m},z_{n})]\i

Figures (6)

  • Figure 1: The scheme of GWIB. $\hat{\rho}_0$ and $\hat{\rho}_1$ denote the empirical covariate distributions of control and treatment groups, respectively. $\phi$ is the encoder mapping covariates to their latent representations. An upper bound of the empirical KMI is approximated based on the Gromovized Monge gap associated with $\phi$. Further considering the classic balancing penalty leads to the proposed regularizer, which involves the GW and FGW distances that share the same optimal transport plan $\bm{T}^*$.
  • Figure 2: (Upper) Illustrations of the GW distances between the covariate distributions and the latent counterparts on ACIC and IHDP datasets. (Lower) The t-SNE plots of the original covariates of ACIC and the latent representations achieved by CFR-Wass and GWIB.
  • Figure 3: Visualizations on training procedure and validation procedure of GWIB on IHDP dataset.
  • Figure 4: $\epsilon_{PEHE}$ and $\epsilon_{ATE}$ of different values of $\lambda$ in both in-sample and out-sample experiments on ACIC dataset.
  • Figure 5: $\epsilon_{PEHE}$ and $\epsilon_{ATE}$ of different values of $\lambda$ in both in-sample and out-sample experiments on IHDP dataset.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Definition 1: Gromovized Monge gap
  • Theorem 2
  • Lemma 1: The upper bound of Jensen gap for logarithmic function in costarelli2015sharp
  • Definition 2: Gromov-Monge distance