Continuous Treatment Effects with Surrogate Outcomes
Zhenghao Zeng, David Arbour, Avi Feller, Raghavendra Addanki, Ryan Rossi, Ritwik Sinha, Edward H. Kennedy
TL;DR
This paper tackles the problem of estimating continuous treatment effects when primary outcomes are partly missing by leveraging surrogate outcomes and unlabeled data in a semi-supervised, doubly robust framework. It derives an identifying characterisation and constructs a pseudo-outcome-based estimator that remain consistent if either the outcome model or the treatment-density models are correctly specified, while also achieving asymptotic normality under nonparametric smoothing. The authors prove oracle efficiency under mild bias conditions and quantify a variance reduction from incorporating surrogates and unlabeled data, supported by simulations and a Job Corps real-data application that reveal nonlinear dose-response behavior. The approach enables robust inference with flexible nuisance estimation (including machine learning) and broad applicability to dose-response estimation with missing primary outcomes. Practical impact lies in more efficient and principled use of surrogate information to recover causal dose-response relationships in settings with costly or incomplete outcomes.
Abstract
In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish the asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance.
