Data-Adaptive Integration With Summary Data
Kosuke Morikawa, Sho Komukai, Satoshi Hattori
TL;DR
A generalized entropy-balancing integration strategy is developed that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample, and is implemented in the R package daisy.
Abstract
Combining an internal individual-level study with readily available external summary statistics promises major efficiency gains at minimal additional cost, yet heterogeneity between sources can bias estimates for the internal target population. We develop a generalized entropy-balancing integration strategy that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample. Our estimator of the internal-population mean is doubly robust: it remains consistent when either the outcome-regression model or the entropy-balancing modelis correctly specified. When multiple balancing specifications are plausible, we introduce a data-adaptive selection rule. We also provide easy-to-compute, fully estimable diagnostics-based on the Mahalanobis distance and the Pearson chi-square divergence-that pinpoint when integration is guaranteed to strictly outperform the internal sample mean. The approach is implemented in the R package daisy. Simulations and an application to nationwide public-access defibrillation records in Japan demonstrate meaningful precision gains while maintaining bias control under distributional shift.
