Table of Contents
Fetching ...

Targeted Data Fusion for Causal Survival Analysis Under Distribution Shift

Yi Liu, Alexander W. Levis, Ke Zhu, Shu Yang, Peter B. Gilbert, Larry Han

TL;DR

This work tackles causal survival analysis across multiple data sources under distribution shift and privacy constraints. It advances two methods: a semiparametric efficient estimator under CCOD when data can be pooled and individual-level sharing is allowed, and a privacy-preserving federated estimator that adaptively weights source contributions when sharing is restricted. The CCOD estimator achieves uniform consistency and asymptotic normality, leveraging cross-fitting and ensemble nuisance estimation to reach efficiency gains; the federated approach yields consistent, asymptotically normal estimates with potentially smaller variance than target-only methods via data-adaptive site weighting. Applied to multi-site HIV prevention trials, the methods demonstrate accurate, privacy-preserving inference for time-to-event outcomes and reveal how source similarity affects information borrowing, with practical implications for coordinating multi-site studies under regulatory constraints.

Abstract

Causal inference across multiple data sources offers a promising avenue to enhance the generalizability and replicability of scientific findings. However, data integration methods for time-to-event outcomes, common in biomedical research, are underdeveloped. Existing approaches focus on binary or continuous outcomes but fail to address the unique challenges of survival analysis, such as censoring and the integration of discrete and continuous time. To bridge this gap, we propose two novel methods for estimating target site-specific causal effects in multi-source settings. First, we develop a semiparametric efficient estimator for settings where individual-level data can be shared across sites. Second, we introduce a federated learning framework designed for privacy-constrained environments, which dynamically reweights source-specific contributions to account for discrepancies with the target population. Both methods leverage flexible, nonparametric machine learning models to improve robustness and efficiency. We illustrate the utility of our approaches through simulation studies and an application to multi-site randomized trials of monoclonal neutralizing antibodies for HIV-1 prevention, conducted among cisgender men and transgender persons in the United States, Brazil, Peru, and Switzerland, as well as among women in sub-Saharan Africa. Our findings underscore the potential of these methods to enable efficient, privacy-preserving causal inference for time-to-event outcomes under distribution shift.

Targeted Data Fusion for Causal Survival Analysis Under Distribution Shift

TL;DR

This work tackles causal survival analysis across multiple data sources under distribution shift and privacy constraints. It advances two methods: a semiparametric efficient estimator under CCOD when data can be pooled and individual-level sharing is allowed, and a privacy-preserving federated estimator that adaptively weights source contributions when sharing is restricted. The CCOD estimator achieves uniform consistency and asymptotic normality, leveraging cross-fitting and ensemble nuisance estimation to reach efficiency gains; the federated approach yields consistent, asymptotically normal estimates with potentially smaller variance than target-only methods via data-adaptive site weighting. Applied to multi-site HIV prevention trials, the methods demonstrate accurate, privacy-preserving inference for time-to-event outcomes and reveal how source similarity affects information borrowing, with practical implications for coordinating multi-site studies under regulatory constraints.

Abstract

Causal inference across multiple data sources offers a promising avenue to enhance the generalizability and replicability of scientific findings. However, data integration methods for time-to-event outcomes, common in biomedical research, are underdeveloped. Existing approaches focus on binary or continuous outcomes but fail to address the unique challenges of survival analysis, such as censoring and the integration of discrete and continuous time. To bridge this gap, we propose two novel methods for estimating target site-specific causal effects in multi-source settings. First, we develop a semiparametric efficient estimator for settings where individual-level data can be shared across sites. Second, we introduce a federated learning framework designed for privacy-constrained environments, which dynamically reweights source-specific contributions to account for discrepancies with the target population. Both methods leverage flexible, nonparametric machine learning models to improve robustness and efficiency. We illustrate the utility of our approaches through simulation studies and an application to multi-site randomized trials of monoclonal neutralizing antibodies for HIV-1 prevention, conducted among cisgender men and transgender persons in the United States, Brazil, Peru, and Switzerland, as well as among women in sub-Saharan Africa. Our findings underscore the potential of these methods to enable efficient, privacy-preserving causal inference for time-to-event outcomes under distribution shift.

Paper Structure

This paper contains 31 sections, 15 theorems, 121 equations, 11 figures, 1 table.

Key Result

Proposition 2.4

The nonparametric EIF for $\theta^0(t,a)$ given $t\in[0,\tau]$ and $a\in\{0,1\}$ is $\varphi^{*0}_{t,a}(\mathcal{O};S^0,G^0,\pi^0) = \varphi^{*0}_{t,a}(\mathcal{O};\mathbb{P}) = \varphi^0_{t,a}(\mathcal{O};\mathbb{P})-\theta^0(t,a)$, where

Figures (11)

  • Figure 1: DAG for data structures under different CCOD assumptions. When CCOD holds (Panel (a)), $R$ and $T$ are conditionally independent given treatment $A$ and covariates $\mathbf X$, consistent with Assumption \ref{['asp:ccod']}. When CCOD is potentially violated, as highlighted by the orange dashed arrow in Panel (b), $R$ and $T$ are not necessarily conditionally independent.
  • Figure 2: Illustration of the Federated Algorithm. Each site has its underlying survival functions, which are first estimated in a site-specific manner. Then, for all $a\in\{0,1\}$ and $t\in[0,\tau]$, treatment- and time-specific federated weights for the target and source sites are derived, ultimately producing the weighted average survival functions (the purple step).
  • Figure 3: Simulation results evaluated at day 90, under $n_0=300$ and $n_k=600$, $k=1,2,3,4$.
  • Figure 4: Data analysis results (South Africa as the target site). Panel (A): Missing values early in IVW indicate large IVW weights. Relative efficiency, defined as the ratio of the estimated standard deviation to that of the TGT estimator, presented for three time points: 148, 330, and 512 days. Panel (B): Time-specific federated weights and smoothed weights using locally weighted regression cleveland1988locally.
  • Figure 5: True treatment-specific survival curves across different sites. Each curve is derived from a random sample of $n=10^4$ generated under the site's own DGP. Dashed lines represent the target site's survival curves for reference. Under covariate shift, the curves maintain similar shapes and trends, differing primarily in scale. In contrast, outcome shift leads to marked alterations in the shapes of the survival curves.
  • ...and 6 more figures

Theorems & Definitions (30)

  • Proposition 2.4: Westling et al. westling2023inference
  • Theorem 2.7
  • Theorem 2.8
  • Remark 2.9
  • Theorem 2.10
  • Remark 2.11
  • Theorem 2.12
  • proof
  • Lemma D.1
  • proof
  • ...and 20 more