Table of Contents
Fetching ...

Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation

Divyat Mahajan, Ioannis Mitliagkas, Brady Neal, Vasilis Syrgkanis

TL;DR

This paper tackles model selection for heterogeneous causal effect estimation (CATE) under the fundamental challenge of unobserved counterfactuals. It performs an extensive empirical benchmark of 34 surrogate metrics across 415 estimators on 75 realistic datasets, using AutoML to tune nuisance models and RealCause for counterfactuals. The authors introduce a two-level model selection strategy and causal ensembling, and propose novel surrogate criteria built from adaptive propensity clipping, targeted learning, calibration, and Qini concepts. Key findings show that doubly robust and TMLE-based surrogates dominate, plug-in T/Learner-based metrics remain highly competitive when nuisance models are well-tuned, and two-level selection with ensembling yields meaningful gains in performance and robustness.

Abstract

We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling.

Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation

TL;DR

This paper tackles model selection for heterogeneous causal effect estimation (CATE) under the fundamental challenge of unobserved counterfactuals. It performs an extensive empirical benchmark of 34 surrogate metrics across 415 estimators on 75 realistic datasets, using AutoML to tune nuisance models and RealCause for counterfactuals. The authors introduce a two-level model selection strategy and causal ensembling, and propose novel surrogate criteria built from adaptive propensity clipping, targeted learning, calibration, and Qini concepts. Key findings show that doubly robust and TMLE-based surrogates dominate, plug-in T/Learner-based metrics remain highly competitive when nuisance models are well-tuned, and two-level selection with ensembling yields meaningful gains in performance and robustness.

Abstract

We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling.
Paper Structure (44 sections, 47 equations, 3 figures, 13 tables)

This paper contains 44 sections, 47 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: The proposed framework for comparing the different surrogate model selection strategies $M(\hat{\tau})$. We first perform intra-meta-learner selection using meta-learner based metrics, and then construct an ensemble over the optimal meta-learners using the input surrogate metric $M(\hat{\tau})$. Further, RealCause enables us to sample counterfactual data for realistic datases as well and benchmark the performance of each surrogate metric $M(\hat{\tau})$ as the PEHE of the ensemble returned by it.
  • Figure 2: Illustrating the construction of indirect and direct meta-learners used in our empirical study. We use a large grid over different regression model classes and hyperparameters for choosing the CATE predictor in direct meta-learners, resulting in 103 different CATE estimators per direct meta-learner. This design choice is an improvement from prior works which consider only a few choices for hyperparameters of direct meta-learners.
  • Figure 3: Illustrating the use of AutoML in selecting the nuisance parameters of surrogate metrics for CATE model selection. This is an important design choice in contrast to prior works which relied on small grid searches to infer the nuisance parameters associated with surrogate metrics, potentially resulting in biased estimates and affecting the model selection ability of surrogate metrics.