Table of Contents
Fetching ...

Can synthetic data reproduce real-world findings in epidemiology? A replication study using adversarial random forests

Jan Kapar, Kathrin Günther, Lori Ann Vallis, Klaus Berger, Nadine Binder, Hermann Brenner, Stefanie Castell, Beate Fischer, Volker Harth, Bernd Holleczek, Timm Intemann, Till Ittermann, André Karch, Thomas Keil, Lilian Krist, Berit Lange, Michael F. Leitzmann, Katharina Nimptsch, Nadia Obi, Iris Pigeot, Tobias Pischon, Tamara Schikowski, Börge Schmidt, Carsten Oliver Schmidt, Anja M. Sedlmair, Justine Tanoey, Harm Wienbergen, Andreas Wienke, Claudia Wigmann, Marvin N. Wright

Abstract

Synthetic data holds substantial potential to address practical challenges in epidemiology due to restricted data access and privacy concerns. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility and measure privacy risks sufficiently. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research while preserving privacy. We propose adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications covering blood pressure, anthropometry, myocardial infarction, accelerometry, loneliness, and diabetes, from the German National Cohort (NAKO Gesundheitsstudie), the Bremen STEMI Registry U45 Study, and the Guelph Family Health Study. We further assessed how dataset dimensionality and variable complexity affect the quality of synthetic data, and contextualized ARF's performance by comparison with commonly used tabular data synthesizers in terms of utility, privacy, generalisation, and runtime. Across all replicated studies, results on ARF-generated synthetic data consistently aligned with original findings. Even for datasets with relatively low sample size-to-dimensionality ratios, replication outcomes closely matched the original results across descriptive and inferential analyses. Reduced dimensionality and variable complexity further enhanced synthesis quality. ARF demonstrated favourable performance regarding utility, privacy preservation, and generalisation relative to other synthesizers and superior computational efficiency.

Can synthetic data reproduce real-world findings in epidemiology? A replication study using adversarial random forests

Abstract

Synthetic data holds substantial potential to address practical challenges in epidemiology due to restricted data access and privacy concerns. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility and measure privacy risks sufficiently. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research while preserving privacy. We propose adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications covering blood pressure, anthropometry, myocardial infarction, accelerometry, loneliness, and diabetes, from the German National Cohort (NAKO Gesundheitsstudie), the Bremen STEMI Registry U45 Study, and the Guelph Family Health Study. We further assessed how dataset dimensionality and variable complexity affect the quality of synthetic data, and contextualized ARF's performance by comparison with commonly used tabular data synthesizers in terms of utility, privacy, generalisation, and runtime. Across all replicated studies, results on ARF-generated synthetic data consistently aligned with original findings. Even for datasets with relatively low sample size-to-dimensionality ratios, replication outcomes closely matched the original results across descriptive and inferential analyses. Reduced dimensionality and variable complexity further enhanced synthesis quality. ARF demonstrated favourable performance regarding utility, privacy preservation, and generalisation relative to other synthesizers and superior computational efficiency.

Paper Structure

This paper contains 62 sections, 2 equations, 25 figures, 3 tables.

Figures (25)

  • Figure 1: Replication of Figure 2 in Schikowski et al. schikowski2020blutdruckmessung: differences of mean blood pressure values (in mmHg) using the mean of first and second measurement or the second measurement only by sex and age group (in years). Percentile-based 95% bootstrap confidence intervals are reported for original data results. Median and percentile-based 95% confidence intervals of synthetic data results are reported. avg., average; meas., measurement; MASD, mean absolute standardised difference; CIO, mean confidence interval overlap
  • Figure 2: Replication of Figure 5 in Fischer et al. fischer2020anthropometrisch: subcutaneous and visceral abdominal adipose tissue thickness by sex. Percentile-based 95% bootstrap confidence intervals are reported for original data results. Median and percentile-based 95% confidence intervals of synthetic data results are reported. In order to provide a clear comparability of the distributions despite the variability in the distribution of missing values in the synthetic data, the counts for the synthetic results were rescaled to match the total count of the real ones; WD, Wasserstein distance
  • Figure 3: Replication of separate logistic regressions per variable, each adjusted for age, sex, country of birth, and years of school education, Table 2 in Wienbergen et al. wienbergen2022infarction: associations between lifestyle and metabolic factors, as well as family history of premature MI and risk of early–onset MI. Median beta estimates and percentile-based 95% confidence intervals computed from the beta coefficient distribution across synthesis repetitions are reported for synthetic data. BMI, body mass index; MI, myocardial infarction; MASD, mean absolute standardised difference; mean CIO, confidence interval overlap
  • Figure 4: Replication of Figure 2 in Breau et al. breau2022cutpoint: calculated average valid wear time minutes per day spent in SED, LPA, and MVPA according to age-appropriate ActiGraph cutpoint sets by age group. Percentile-based 95% bootstrap confidence intervals are reported for original data results. Median and percentile-based 95% confidence intervals of synthetic data results are reported. SED, sedentary behaviour; LPA, light physical activity; MVPA, moderate to vigorous physical activity; VA, vertical axis; VM, vector magnitude; MASD, mean absolute standardised difference; CIO, mean confidence interval overlap
  • Figure 5: Replication of multivariable linear regression, Table 3 in Berger et al. berger2021COVID, with general and task-specific synthesis: relationship between perceived loneliness and sociodemographic factors as well as symptoms of depression and anxiety among German National Cohort (NAKO Gesundheitsstudie) participants in May 2020. Median beta estimates and percentile-based 95% confidence intervals computed from the beta coefficient distribution across synthesis repetitions are reported for synthetic data. PHQ-9, nine-item Patient Health Questionnaire; GAD-7, Generalised Anxiety Disorder seven-item scale; MASD, mean absolute standardised difference; CIO, mean confidence interval overlap
  • ...and 20 more figures