Table of Contents
Fetching ...

Leveraging a Simulator for Learning Causal Representations from Post-Treatment Covariates for CATE

Lokesh Nagalapatti, Pranava Singhal, Avishek Ghosh, Sunita Sarawagi

TL;DR

This work tackles the identifiability challenge of estimating Conditional Average Treatment Effects when covariates are observed post-treatment. It proposes SimPONet, a joint real–simulator learning framework that regularizes learning using simulator-derived representations and simulated CATE, guided by a theoretical bound on CATE error under real–simulator mismatch. The method is validated across linear and nonlinear data-generating processes and semi-synthetic benchmarks, showing robust improvements over baselines and illuminating when simulator supervision benefits CATE estimation. The contributions include a formalization of the real–simulator gap, a practical learning objective that adapts to simulator relevance, and extensive experiments demonstrating improved CATE estimation in post-treatment covariate settings with realistic data perturbations.

Abstract

Treatment effect estimation involves assessing the impact of different treatments on individual outcomes. Current methods estimate Conditional Average Treatment Effect (CATE) using observational datasets where covariates are collected before treatment assignment and outcomes are observed afterward, under assumptions like positivity and unconfoundedness. In this paper, we address a scenario where both covariates and outcomes are gathered after treatment. We show that post-treatment covariates render CATE unidentifiable, and recovering CATE requires learning treatment-independent causal representations. Prior work shows that such representations can be learned through contrastive learning if counterfactual supervision is available in observational data. However, since counterfactuals are rare, other works have explored using simulators that offer synthetic counterfactual supervision. Our goal in this paper is to systematically analyze the role of simulators in estimating CATE. We analyze the CATE error of several baselines and highlight their limitations. We then establish a generalization bound that characterizes the CATE error from jointly training on real and simulated distributions, as a function of the real-simulator mismatch. Finally, we introduce SimPONet, a novel method whose loss function is inspired from our generalization bound. We further show how SimPONet adjusts the simulator's influence on the learning objective based on the simulator's relevance to the CATE task. We experiment with various DGPs, by systematically varying the real-simulator distribution gap to evaluate SimPONet's efficacy against state-of-the-art CATE baselines.

Leveraging a Simulator for Learning Causal Representations from Post-Treatment Covariates for CATE

TL;DR

This work tackles the identifiability challenge of estimating Conditional Average Treatment Effects when covariates are observed post-treatment. It proposes SimPONet, a joint real–simulator learning framework that regularizes learning using simulator-derived representations and simulated CATE, guided by a theoretical bound on CATE error under real–simulator mismatch. The method is validated across linear and nonlinear data-generating processes and semi-synthetic benchmarks, showing robust improvements over baselines and illuminating when simulator supervision benefits CATE estimation. The contributions include a formalization of the real–simulator gap, a practical learning objective that adapts to simulator relevance, and extensive experiments demonstrating improved CATE estimation in post-treatment covariate settings with realistic data perturbations.

Abstract

Treatment effect estimation involves assessing the impact of different treatments on individual outcomes. Current methods estimate Conditional Average Treatment Effect (CATE) using observational datasets where covariates are collected before treatment assignment and outcomes are observed afterward, under assumptions like positivity and unconfoundedness. In this paper, we address a scenario where both covariates and outcomes are gathered after treatment. We show that post-treatment covariates render CATE unidentifiable, and recovering CATE requires learning treatment-independent causal representations. Prior work shows that such representations can be learned through contrastive learning if counterfactual supervision is available in observational data. However, since counterfactuals are rare, other works have explored using simulators that offer synthetic counterfactual supervision. Our goal in this paper is to systematically analyze the role of simulators in estimating CATE. We analyze the CATE error of several baselines and highlight their limitations. We then establish a generalization bound that characterizes the CATE error from jointly training on real and simulated distributions, as a function of the real-simulator mismatch. Finally, we introduce SimPONet, a novel method whose loss function is inspired from our generalization bound. We further show how SimPONet adjusts the simulator's influence on the learning objective based on the simulator's relevance to the CATE task. We experiment with various DGPs, by systematically varying the real-simulator distribution gap to evaluate SimPONet's efficacy against state-of-the-art CATE baselines.

Paper Structure

This paper contains 38 sections, 5 theorems, 24 equations, 5 figures, 8 tables, 1 algorithm.

Key Result

Lemma 1

The Conditional Average Treatment Effect (CATE) of $T$ on $Y$, given $X$, is not identifiable using i.i.d. samples of the observed variables from the true data-generating process depicted in the top panel of Fig. fig:our_DGP.

Figures (5)

  • Figure 1: The Data Generating process for Real and Simulator.
  • Figure 2: Factual errors with $p$-values shown above bars. For IHDP, RealOnly consistently outperforms $\text{Real}_\mu\text{Sim}_f$.
  • Figure 3: Comparing CATE errors under pre-treatment $Z$, and MLP, Normalizing flow generated post-treatment covariates $X$.
  • Figure 4: We vary $\gamma_\tau$, which controls the gap between the synthetic CATE, $\tau^S$, and the real CATE, $\tau$. Each dataset is represented by a distinct color, where the pale version of the color indicates SimOnly and the darker version denotes SimPONet. For ACIC-7 and ACIC-26, as $\gamma_\tau$ increases, the CATE error grows significantly. Therefore, we present these results as an inset figure in the top-right corner.
  • Figure 5: SimPONet's model architecture.

Theorems & Definitions (5)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5