Table of Contents
Fetching ...

Efficient estimation and data fusion under general semiparametric restrictions on outcome mean functions

Harrison H. Li

TL;DR

The paper develops a general semiparametric efficiency framework for estimating causal estimands under restrictions on the outcome mean function $m(r)=\mathbb{E}[Y\mid R=r]$, enabling efficient data fusion of an RCT with observational data. By introducing the outcome-mean tangent space $\mathcal{S}_{\mathcal{M}}$ and its orthogonal complement, it provides a three-step method to obtain the efficient influence function and hence the efficiency bound, and it constructs cross-fit one-step estimators that achieve this bound under suitable regularity. It derives explicit EIFs and efficient estimators for data-fusion settings with outcome-mediated selection bias ($\mathcal{M}_5$) and linear confounding bias ($\mathcal{M}_4$), and demonstrates substantial finite-sample gains through simulations and a Tennessee STAR data example. The framework unifies many existing data-fusion approaches under a single theory and offers practical guidance for leveraging auxiliary observational data to sharpen causal inference, with clear conditions under which efficiency gains are expected or limited.

Abstract

We provide a novel characterization of semiparametric efficiency in a generic supervised learning setting where the outcome mean function -- defined as the conditional expectation of the outcome of interest given the other observed variables -- is restricted to lie in some known semiparametric function class. The primary motivation is causal inference where a researcher running a randomized controlled trial often has access to an auxiliary observational dataset that is confounded or otherwise biased for estimating causal effects. Prior work has imposed various bespoke assumptions on this bias in an attempt to improve precision via data fusion. We show how many of these assumptions can be formulated as restrictions on the outcome mean function in the concatenation of the experimental and observational datasets. Then our theory provides a unified framework to maximally leverage such restrictions for precision gain by constructing efficient estimators in all of these settings as well as in a wide range of others that future investigators might be interested in. For example, when the observational dataset is subject to outcome-mediated selection bias, we show our novel efficient estimator dominates an existing control variate approach both asymptotically and in numerical studies.

Efficient estimation and data fusion under general semiparametric restrictions on outcome mean functions

TL;DR

The paper develops a general semiparametric efficiency framework for estimating causal estimands under restrictions on the outcome mean function , enabling efficient data fusion of an RCT with observational data. By introducing the outcome-mean tangent space and its orthogonal complement, it provides a three-step method to obtain the efficient influence function and hence the efficiency bound, and it constructs cross-fit one-step estimators that achieve this bound under suitable regularity. It derives explicit EIFs and efficient estimators for data-fusion settings with outcome-mediated selection bias () and linear confounding bias (), and demonstrates substantial finite-sample gains through simulations and a Tennessee STAR data example. The framework unifies many existing data-fusion approaches under a single theory and offers practical guidance for leveraging auxiliary observational data to sharpen causal inference, with clear conditions under which efficiency gains are expected or limited.

Abstract

We provide a novel characterization of semiparametric efficiency in a generic supervised learning setting where the outcome mean function -- defined as the conditional expectation of the outcome of interest given the other observed variables -- is restricted to lie in some known semiparametric function class. The primary motivation is causal inference where a researcher running a randomized controlled trial often has access to an auxiliary observational dataset that is confounded or otherwise biased for estimating causal effects. Prior work has imposed various bespoke assumptions on this bias in an attempt to improve precision via data fusion. We show how many of these assumptions can be formulated as restrictions on the outcome mean function in the concatenation of the experimental and observational datasets. Then our theory provides a unified framework to maximally leverage such restrictions for precision gain by constructing efficient estimators in all of these settings as well as in a wide range of others that future investigators might be interested in. For example, when the observational dataset is subject to outcome-mediated selection bias, we show our novel efficient estimator dominates an existing control variate approach both asymptotically and in numerical studies.
Paper Structure (29 sections, 16 theorems, 249 equations, 1 figure, 4 tables)

This paper contains 29 sections, 16 theorems, 249 equations, 1 figure, 4 tables.

Key Result

Theorem 1

Suppose a RAL estimator of a pathwise differentiable estimand $\tau \in \mathbb{R}^d$ in the model $\mathcal{P}_{\mathcal{M}}$ exists with influence function $\varphi_0=\varphi_0(\cdot;\tau^*,\eta^*)$. Further assume the semiparametric tangent space $\mathcal{T}_{\mathcal{M}}^d$ for the model $\math

Figures (1)

  • Figure 1: The mean squared error (MSE), variance, and squared bias of our efficient one-step estimator $\hat{\tau}_{\textnormal{eff}}^{(4)}$ for $\tau_{\textnormal{obs}}$ and the baseline AIPSW estimator $\hat{\tau}_{\textnormal{ba}}$ in the Tennessee STAR data example described in Section \ref{['sec:tennessee_star']}, as a function of the fraction $\xi$ of the entire RCT used.

Theorems & Definitions (37)

  • Example 1: Restricted moment model
  • Example 2: Mean-exchangeable controls, li_improving_2023
  • Example 3: Parametric confounding bias and CATE, yang2024datafusion
  • Example 4: Linear confounding bias, kallus_removing_2018
  • Example 5: Outcome-mediated selection bias, guo2022multi
  • Theorem 1: Theorem 4.3, tsiatis2006semiparametric
  • Definition 1
  • Theorem 2
  • proof
  • Proposition 1
  • ...and 27 more