Table of Contents
Fetching ...

Data-Adaptive Integration With Summary Data

Kosuke Morikawa, Sho Komukai, Satoshi Hattori

TL;DR

A generalized entropy-balancing integration strategy is developed that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample, and is implemented in the R package daisy.

Abstract

Combining an internal individual-level study with readily available external summary statistics promises major efficiency gains at minimal additional cost, yet heterogeneity between sources can bias estimates for the internal target population. We develop a generalized entropy-balancing integration strategy that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample. Our estimator of the internal-population mean is doubly robust: it remains consistent when either the outcome-regression model or the entropy-balancing modelis correctly specified. When multiple balancing specifications are plausible, we introduce a data-adaptive selection rule. We also provide easy-to-compute, fully estimable diagnostics-based on the Mahalanobis distance and the Pearson chi-square divergence-that pinpoint when integration is guaranteed to strictly outperform the internal sample mean. The approach is implemented in the R package daisy. Simulations and an application to nationwide public-access defibrillation records in Japan demonstrate meaningful precision gains while maintaining bias control under distributional shift.

Data-Adaptive Integration With Summary Data

TL;DR

A generalized entropy-balancing integration strategy is developed that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample, and is implemented in the R package daisy.

Abstract

Combining an internal individual-level study with readily available external summary statistics promises major efficiency gains at minimal additional cost, yet heterogeneity between sources can bias estimates for the internal target population. We develop a generalized entropy-balancing integration strategy that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample. Our estimator of the internal-population mean is doubly robust: it remains consistent when either the outcome-regression model or the entropy-balancing modelis correctly specified. When multiple balancing specifications are plausible, we introduce a data-adaptive selection rule. We also provide easy-to-compute, fully estimable diagnostics-based on the Mahalanobis distance and the Pearson chi-square divergence-that pinpoint when integration is guaranteed to strictly outperform the internal sample mean. The approach is implemented in the R package daisy. Simulations and an application to nationwide public-access defibrillation records in Japan demonstrate meaningful precision gains while maintaining bias control under distributional shift.

Paper Structure

This paper contains 20 sections, 9 theorems, 77 equations, 5 figures, 2 tables.

Key Result

Proposition 1

Under Conditions (C1), (C2), (C4), and (C3) or (C3)$'$, the unique solution to est_eq1 and est_eq2 exists for $\hat{\theta}_{\mathrm{EBW}}$ with probability tending to one as $n\to\infty$.

Figures (5)

  • Figure 1: Conceptual diagram for the construction of the proposed estimator $\hat{\theta}_{\mathrm{EB}}$. The dashed lines indicate the summary statistics used in the entropy-balancing constraints, while the solid lines show how $\{\tilde{\mathcal{H}}_i\}$ and $\hat{\theta}_{\mathrm{EB}}$ are constructed from the estimated weights.
  • Figure 2: Boxplots of Monte Carlo estimates of the mean outcome $\mathrm{E}(Y)$ across simulation settings. Each panel corresponds to one external $X_1$ distribution (top row: $a=0$; bottom row: $a=1$). The red solid line represents the true mean outcome.
  • Figure 3: Empirical coverage probabilities of bootstrap 95% confidence intervals (CI) for the proposed methods (EB(KL), EB(S1EF), EB(Select), EBw(KL), EBw(S1EF), EBw(Select)) and 95% CI based on the Wald-type variance for the sample mean (SM). Panel (A) uses a bootstrap procedure that accounts for the sampling variability in the external summary information, whereas panel (B) treats the external summaries as fixed and ignores their variability. The red solid horizontal line marks the nominal level 0.95.
  • Figure 4: Auxiliary results for the numerical study. Panels (A) and (B) report barplots of the counts (out of 1,000 Monte Carlo iterations) of the selected first-step divergence family for EB and EBw, respectively (KL, LW, QLS, TS). Panel (C) shows boxplots of the selection criterion $D_1$ across settings; the red solid line indicates the EB threshold $1$ and the blue solid line indicates the EBw threshold $\sqrt{\log 2}$. When $D_1$ falls below the threshold, the selection-based procedure uses the proposed estimator; otherwise it reverts to SM.
  • Figure 5: Forest plot summarizing the real-data analyses. Each panel compares the internal-only estimator with naive pooling and the proposed EB/EBw methods when incorporating external data (2019 or 2010--2015, $n=1,000$). Horizontal bars represent 95% confidence intervals, and the dashed line indicates the 2019 registry mean (0.407).

Theorems & Definitions (14)

  • Remark
  • Example 2.1: Historical Controls in Randomized Controlled Trials
  • Example 2.2: Integration of Probability and Non-Probability Sampling
  • Example 3.1: Choice of Entropy Function $G_1$
  • Proposition 1
  • Theorem 1
  • proof
  • Theorem 2
  • Proposition 2
  • Theorem 3
  • ...and 4 more