Table of Contents
Fetching ...

Reliable Selection of Heterogeneous Treatment Effect Estimators

Jiayi Guo, Zijun Gao

TL;DR

This work tackles selecting the best heterogeneous treatment effect estimator without access to ground-truth ITEs by treating estimator selection as argmin inference under a multi-estimator setting. It introduces a ground-truth-free procedure based on a cross-fitted, exponentially weighted statistic with a two-layer sample-splitting scheme, and proves asymptotic control of the familywise error rate via a stability-based central limit theorem. Empirically, the method reduces false selections across ACIC 2016, IHDP, and Twins benchmarks while remaining effective as the number of candidates grows and when nuisance estimators are black-box models. The approach offers a principled, scalable way to compare HTE estimators in real-world data, with practical implications for model selection in personalized decision-making contexts.

Abstract

We study the problem of selecting the best heterogeneous treatment effect (HTE) estimator from a collection of candidates in settings where the treatment effect is fundamentally unobserved. We cast estimator selection as a multiple testing problem and introduce a ground-truth-free procedure based on a cross-fitted, exponentially weighted test statistic. A key component of our method is a two-way sample splitting scheme that decouples nuisance estimation from weight learning and ensures the stability required for valid inference. Leveraging a stability-based central limit theorem, we establish asymptotic familywise error rate control under mild regularity conditions. Empirically, our procedure provides reliable error control while substantially reducing false selections compared with commonly used methods across ACIC 2016, IHDP, and Twins benchmarks, demonstrating that our method is feasible and powerful even without ground-truth treatment effects.

Reliable Selection of Heterogeneous Treatment Effect Estimators

TL;DR

This work tackles selecting the best heterogeneous treatment effect estimator without access to ground-truth ITEs by treating estimator selection as argmin inference under a multi-estimator setting. It introduces a ground-truth-free procedure based on a cross-fitted, exponentially weighted statistic with a two-layer sample-splitting scheme, and proves asymptotic control of the familywise error rate via a stability-based central limit theorem. Empirically, the method reduces false selections across ACIC 2016, IHDP, and Twins benchmarks while remaining effective as the number of candidates grows and when nuisance estimators are black-box models. The approach offers a principled, scalable way to compare HTE estimators in real-world data, with practical implications for model selection in personalized decision-making contexts.

Abstract

We study the problem of selecting the best heterogeneous treatment effect (HTE) estimator from a collection of candidates in settings where the treatment effect is fundamentally unobserved. We cast estimator selection as a multiple testing problem and introduce a ground-truth-free procedure based on a cross-fitted, exponentially weighted test statistic. A key component of our method is a two-way sample splitting scheme that decouples nuisance estimation from weight learning and ensures the stability required for valid inference. Leveraging a stability-based central limit theorem, we establish asymptotic familywise error rate control under mild regularity conditions. Empirically, our procedure provides reliable error control while substantially reducing false selections compared with commonly used methods across ACIC 2016, IHDP, and Twins benchmarks, demonstrating that our method is feasible and powerful even without ground-truth treatment effects.

Paper Structure

This paper contains 28 sections, 6 theorems, 83 equations, 13 figures, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

Under Assumptions a1–a3, the selection rule defined above satisfies

Figures (13)

  • Figure 1: Difference in false selections between the Naive Method and our proposed method (mean $\pm$ 95% bootstrap confidence interval) under a linear toy model with three competitive and four clearly inferior estimators. See Appendix \ref{['app-toy']} for details.
  • Figure 2: The dataset is first split horizontally into Fold A (top) and Fold B (bottom). Each half is then partitioned vertically into $K$ subfolds.
  • Figure 3: Comparison of FWER under different data-splitting schemes (mean and variability over 100 repetitions) in the same linear model. Careful cross-fitting ensures proper error control, while casual splits lead to inflated FWER. Details are provided in Appendix \ref{['app-toy']}.
  • Figure : (a) ACIC dataset
  • Figure : (a) ACIC2016 dataset
  • ...and 8 more figures

Theorems & Definitions (12)

  • Theorem 3.1: FWER control of the naive max–statistic test
  • Theorem 3.2: FWER control of the cross-fitted exponentially weighted test
  • Definition 1
  • Theorem 3.3: Stability-based CLT for globally dependent data
  • Theorem 3.4: First Order Stability
  • Theorem 3.5: Second Order Stability
  • proof
  • Theorem D.1: First- and second-order stability under local smoothness
  • proof : Proof sketch
  • Remark 1: What the assumptions buy
  • ...and 2 more