Efficient estimation and data fusion under general semiparametric restrictions on outcome mean functions
Harrison H. Li
TL;DR
The paper develops a general semiparametric efficiency framework for estimating causal estimands under restrictions on the outcome mean function $m(r)=\mathbb{E}[Y\mid R=r]$, enabling efficient data fusion of an RCT with observational data. By introducing the outcome-mean tangent space $\mathcal{S}_{\mathcal{M}}$ and its orthogonal complement, it provides a three-step method to obtain the efficient influence function and hence the efficiency bound, and it constructs cross-fit one-step estimators that achieve this bound under suitable regularity. It derives explicit EIFs and efficient estimators for data-fusion settings with outcome-mediated selection bias ($\mathcal{M}_5$) and linear confounding bias ($\mathcal{M}_4$), and demonstrates substantial finite-sample gains through simulations and a Tennessee STAR data example. The framework unifies many existing data-fusion approaches under a single theory and offers practical guidance for leveraging auxiliary observational data to sharpen causal inference, with clear conditions under which efficiency gains are expected or limited.
Abstract
We provide a novel characterization of semiparametric efficiency in a generic supervised learning setting where the outcome mean function -- defined as the conditional expectation of the outcome of interest given the other observed variables -- is restricted to lie in some known semiparametric function class. The primary motivation is causal inference where a researcher running a randomized controlled trial often has access to an auxiliary observational dataset that is confounded or otherwise biased for estimating causal effects. Prior work has imposed various bespoke assumptions on this bias in an attempt to improve precision via data fusion. We show how many of these assumptions can be formulated as restrictions on the outcome mean function in the concatenation of the experimental and observational datasets. Then our theory provides a unified framework to maximally leverage such restrictions for precision gain by constructing efficient estimators in all of these settings as well as in a wide range of others that future investigators might be interested in. For example, when the observational dataset is subject to outcome-mediated selection bias, we show our novel efficient estimator dominates an existing control variate approach both asymptotically and in numerical studies.
