Targeted Data Fusion for Causal Survival Analysis Under Distribution Shift
Yi Liu, Alexander W. Levis, Ke Zhu, Shu Yang, Peter B. Gilbert, Larry Han
TL;DR
This work tackles causal survival analysis across multiple data sources under distribution shift and privacy constraints. It advances two methods: a semiparametric efficient estimator under CCOD when data can be pooled and individual-level sharing is allowed, and a privacy-preserving federated estimator that adaptively weights source contributions when sharing is restricted. The CCOD estimator achieves uniform consistency and asymptotic normality, leveraging cross-fitting and ensemble nuisance estimation to reach efficiency gains; the federated approach yields consistent, asymptotically normal estimates with potentially smaller variance than target-only methods via data-adaptive site weighting. Applied to multi-site HIV prevention trials, the methods demonstrate accurate, privacy-preserving inference for time-to-event outcomes and reveal how source similarity affects information borrowing, with practical implications for coordinating multi-site studies under regulatory constraints.
Abstract
Causal inference across multiple data sources offers a promising avenue to enhance the generalizability and replicability of scientific findings. However, data integration methods for time-to-event outcomes, common in biomedical research, are underdeveloped. Existing approaches focus on binary or continuous outcomes but fail to address the unique challenges of survival analysis, such as censoring and the integration of discrete and continuous time. To bridge this gap, we propose two novel methods for estimating target site-specific causal effects in multi-source settings. First, we develop a semiparametric efficient estimator for settings where individual-level data can be shared across sites. Second, we introduce a federated learning framework designed for privacy-constrained environments, which dynamically reweights source-specific contributions to account for discrepancies with the target population. Both methods leverage flexible, nonparametric machine learning models to improve robustness and efficiency. We illustrate the utility of our approaches through simulation studies and an application to multi-site randomized trials of monoclonal neutralizing antibodies for HIV-1 prevention, conducted among cisgender men and transgender persons in the United States, Brazil, Peru, and Switzerland, as well as among women in sub-Saharan Africa. Our findings underscore the potential of these methods to enable efficient, privacy-preserving causal inference for time-to-event outcomes under distribution shift.
