Efficient estimation and data fusion under general semiparametric restrictions on outcome mean functions

Harrison H. Li

Efficient estimation and data fusion under general semiparametric restrictions on outcome mean functions

Harrison H. Li

TL;DR

The paper develops a general semiparametric efficiency framework for estimating causal estimands under restrictions on the outcome mean function $m(r)=\mathbb{E}[Y\mid R=r]$, enabling efficient data fusion of an RCT with observational data. By introducing the outcome-mean tangent space $\mathcal{S}_{\mathcal{M}}$ and its orthogonal complement, it provides a three-step method to obtain the efficient influence function and hence the efficiency bound, and it constructs cross-fit one-step estimators that achieve this bound under suitable regularity. It derives explicit EIFs and efficient estimators for data-fusion settings with outcome-mediated selection bias ($\mathcal{M}_5$) and linear confounding bias ($\mathcal{M}_4$), and demonstrates substantial finite-sample gains through simulations and a Tennessee STAR data example. The framework unifies many existing data-fusion approaches under a single theory and offers practical guidance for leveraging auxiliary observational data to sharpen causal inference, with clear conditions under which efficiency gains are expected or limited.

Abstract

We provide a novel characterization of semiparametric efficiency in a generic supervised learning setting where the outcome mean function -- defined as the conditional expectation of the outcome of interest given the other observed variables -- is restricted to lie in some known semiparametric function class. The primary motivation is causal inference where a researcher running a randomized controlled trial often has access to an auxiliary observational dataset that is confounded or otherwise biased for estimating causal effects. Prior work has imposed various bespoke assumptions on this bias in an attempt to improve precision via data fusion. We show how many of these assumptions can be formulated as restrictions on the outcome mean function in the concatenation of the experimental and observational datasets. Then our theory provides a unified framework to maximally leverage such restrictions for precision gain by constructing efficient estimators in all of these settings as well as in a wide range of others that future investigators might be interested in. For example, when the observational dataset is subject to outcome-mediated selection bias, we show our novel efficient estimator dominates an existing control variate approach both asymptotically and in numerical studies.

Efficient estimation and data fusion under general semiparametric restrictions on outcome mean functions

TL;DR

The paper develops a general semiparametric efficiency framework for estimating causal estimands under restrictions on the outcome mean function

, enabling efficient data fusion of an RCT with observational data. By introducing the outcome-mean tangent space

and its orthogonal complement, it provides a three-step method to obtain the efficient influence function and hence the efficiency bound, and it constructs cross-fit one-step estimators that achieve this bound under suitable regularity. It derives explicit EIFs and efficient estimators for data-fusion settings with outcome-mediated selection bias (

) and linear confounding bias (

), and demonstrates substantial finite-sample gains through simulations and a Tennessee STAR data example. The framework unifies many existing data-fusion approaches under a single theory and offers practical guidance for leveraging auxiliary observational data to sharpen causal inference, with clear conditions under which efficiency gains are expected or limited.

Abstract

Paper Structure (29 sections, 16 theorems, 249 equations, 1 figure, 4 tables)

This paper contains 29 sections, 16 theorems, 249 equations, 1 figure, 4 tables.

Introduction
Examples
Semiparametric efficiency bounds
Initial influence function
The outcome mean function tangent space
Orthogonal complements and projections
Additional insights
Constructing one-step estimators
Some specific novel efficient estimators
Simulations
Data example
Discussion
Proofs
Preliminaries
Proof of Theorem \ref{['thm:semiparametric_tangent_space']}
...and 14 more sections

Key Result

Theorem 1

Suppose a RAL estimator of a pathwise differentiable estimand $\tau \in \mathbb{R}^d$ in the model $\mathcal{P}_{\mathcal{M}}$ exists with influence function $\varphi_0=\varphi_0(\cdot;\tau^*,\eta^*)$. Further assume the semiparametric tangent space $\mathcal{T}_{\mathcal{M}}^d$ for the model $\math

Figures (1)

Figure 1: The mean squared error (MSE), variance, and squared bias of our efficient one-step estimator $\hat{\tau}_{\textnormal{eff}}^{(4)}$ for $\tau_{\textnormal{obs}}$ and the baseline AIPSW estimator $\hat{\tau}_{\textnormal{ba}}$ in the Tennessee STAR data example described in Section \ref{['sec:tennessee_star']}, as a function of the fraction $\xi$ of the entire RCT used.

Theorems & Definitions (37)

Example 1: Restricted moment model
Example 2: Mean-exchangeable controls, li_improving_2023
Example 3: Parametric confounding bias and CATE, yang2024datafusion
Example 4: Linear confounding bias, kallus_removing_2018
Example 5: Outcome-mediated selection bias, guo2022multi
Theorem 1: Theorem 4.3, tsiatis2006semiparametric
Definition 1
Theorem 2
proof
Proposition 1
...and 27 more

Efficient estimation and data fusion under general semiparametric restrictions on outcome mean functions

TL;DR

Abstract

Efficient estimation and data fusion under general semiparametric restrictions on outcome mean functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (37)