Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation
Divyat Mahajan, Ioannis Mitliagkas, Brady Neal, Vasilis Syrgkanis
TL;DR
This paper tackles model selection for heterogeneous causal effect estimation (CATE) under the fundamental challenge of unobserved counterfactuals. It performs an extensive empirical benchmark of 34 surrogate metrics across 415 estimators on 75 realistic datasets, using AutoML to tune nuisance models and RealCause for counterfactuals. The authors introduce a two-level model selection strategy and causal ensembling, and propose novel surrogate criteria built from adaptive propensity clipping, targeted learning, calibration, and Qini concepts. Key findings show that doubly robust and TMLE-based surrogates dominate, plug-in T/Learner-based metrics remain highly competitive when nuisance models are well-tuned, and two-level selection with ensembling yields meaningful gains in performance and robustness.
Abstract
We study the problem of model selection in causal inference, specifically for conditional average treatment effect (CATE) estimation. Unlike machine learning, there is no perfect analogue of cross-validation for model selection as we do not observe the counterfactual potential outcomes. Towards this, a variety of surrogate metrics have been proposed for CATE model selection that use only observed data. However, we do not have a good understanding regarding their effectiveness due to limited comparisons in prior studies. We conduct an extensive empirical analysis to benchmark the surrogate model selection metrics introduced in the literature, as well as the novel ones introduced in this work. We ensure a fair comparison by tuning the hyperparameters associated with these metrics via AutoML, and provide more detailed trends by incorporating realistic datasets via generative modeling. Our analysis suggests novel model selection strategies based on careful hyperparameter selection of CATE estimators and causal ensembling.
