Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

Ellen Graham; Marco Carone; Andrea Rotnitzky

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

Ellen Graham, Marco Carone, Andrea Rotnitzky

TL;DR

The paper develops a unified semiparametric framework for inference from multiple data sources fused via alignment of conditional and marginal components. It introduces a score-operator approach to connect ideal data influence functions with observed-data influence functions, and derives the efficient influence function for pathwise-differentiable parameters in broad fused-data settings. The theory yields universal templates and, in many cases, closed-form efficient influence functions, enabling machine-learning debiased estimation across scenarios such as two-sample IVs, measurement-error with validation, and epidemiological design mixes. This framework thus provides foundational tools for robust, efficient inference in complex, multi-source data contexts with alignment-based identifiability. The results also illuminate the trade-offs between alignment assumptions and efficiency, and pave the way for sensitivity analyses and extensions to mixed data types and designs.

Abstract

We address the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this paper, we extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, we provide universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation.

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

TL;DR

Abstract

Paper Structure (51 sections, 8 theorems, 375 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 51 sections, 8 theorems, 375 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Review of semiparametric theory
The inferential problem and fused-data framework
Alignment assumptions and the fused-data model definition
Main results
The score operator
Characterizing observed data influence functions
The observed data efficient influence function
Examples revisited
Discussion
Supplementary Material
Glossary of notation
Proofs of main text results
Decomposing ideal data influence functions
Proofs for Supplement C
...and 36 more sections

Key Result

Theorem 1

Given a fused-data model $\left( \mathcal{Q},\mathcal{P},\mathcal{C}\right)$ with respect to $\left( Q_{0},P_{0}\right) ,$ if as:identification holds then $\psi \left( Q\right) =\varphi \left( P\right)$ for any $Q\in \mathcal{Q}$ and $P\in \mathcal{P}$ such that $P \overset{\mathcal{C}}{\approx} Q$.

Figures (3)

Figure 1: Illustration of a fused-data model
Figure 2: Asymptotic relative efficiency of efficient estimators of the ATE under the scenarios (ii), (iii), and (iv) of \ref{['example:transporting']}
Figure S1: Asymptotic relative efficiency of efficient estimators of the ATE under the scenarios (ii), (iii), and (iv) of \ref{['example:transporting']}

Theorems & Definitions (40)

Example 1: Estimating disease prevalence from misclassified disease and an external validation study
Example 2: Two-Sample instrumental variables under a linear structural equation model
Example 3: Transporting average treatment effects
Definition 1
Definition 2
Definition 3
Theorem 1
Definition 4
Lemma 1
Definition 5
...and 30 more

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

TL;DR

Abstract

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (40)