Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research

Bardia Khosravi; Frank Li; Theo Dapamede; Pouria Rouzrokh; Cooper U. Gamble; Hari M. Trivedi; Cody C. Wyles; Andrew B. Sellergren; Saptarshi Purkayastha; Bradley J. Erickson; Judy W. Gichoya

Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research

Bardia Khosravi, Frank Li, Theo Dapamede, Pouria Rouzrokh, Cooper U. Gamble, Hari M. Trivedi, Cody C. Wyles, Andrew B. Sellergren, Saptarshi Purkayastha, Bradley J. Erickson, Judy W. Gichoya

TL;DR

This study shows that diffusion-based generative models can produce high-fidelity CXRs conditioned on demographics and pathologies, enabling synthetic data augmentation. Supplements to real data with synthetic CXRs improve AUROC by up to $0.02$ on internal and external test sets, with up to $1000\%$ augmentation, and a purely synthetic dataset can match the performance of real data with about $200$–$300\%$ augmentation. However, synthetic data alone generally lag behind mixed real+synthetic data, and potential leakage from training data into test distributions warrants careful experimental design. The work demonstrates practical benefits for cross-site generalization and offers guidance on CFG tuning, dataset size, and the computational tradeoffs of diffusion-based augmentation in medical imaging.

Abstract

Chest X-rays (CXR) are essential for diagnosing a variety of conditions, but when used on new populations, model generalizability issues limit their efficacy. Generative AI, particularly denoising diffusion probabilistic models (DDPMs), offers a promising approach to generating synthetic images, enhancing dataset diversity. This study investigates the impact of synthetic data supplementation on the performance and generalizability of medical imaging research. The study employed DDPMs to create synthetic CXRs conditioned on demographic and pathological characteristics from the CheXpert dataset. These synthetic images were used to supplement training datasets for pathology classifiers, with the aim of improving their performance. The evaluation involved three datasets (CheXpert, MIMIC-CXR, and Emory Chest X-ray) and various experiments, including supplementing real data with synthetic data, training with purely synthetic data, and mixing synthetic data with external datasets. Performance was assessed using the area under the receiver operating curve (AUROC). Adding synthetic data to real datasets resulted in a notable increase in AUROC values (up to 0.02 in internal and external test sets with 1000% supplementation, p-value less than 0.01 in all instances). When classifiers were trained exclusively on synthetic data, they achieved performance levels comparable to those trained on real data with 200%-300% data supplementation. The combination of real and synthetic data from different sources demonstrated enhanced model generalizability, increasing model AUROC from 0.76 to 0.80 on the internal test set (p-value less than 0.01). In conclusion, synthetic data supplementation significantly improves the performance and generalizability of pathology classifiers in medical imaging.

Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research

TL;DR

on internal and external test sets, with up to

augmentation, and a purely synthetic dataset can match the performance of real data with about

–

augmentation. However, synthetic data alone generally lag behind mixed real+synthetic data, and potential leakage from training data into test distributions warrants careful experimental design. The work demonstrates practical benefits for cross-site generalization and offers guidance on CFG tuning, dataset size, and the computational tradeoffs of diffusion-based augmentation in medical imaging.

Abstract

Paper Structure (17 sections, 2 equations, 7 figures, 5 tables)

This paper contains 17 sections, 2 equations, 7 figures, 5 tables.

Introduction
Methods
Dataset Description
Image Generation
Pathology Classification
Supplementing real data with synthetic data from the same origin:
Purely synthetic data:
Mixing synthetic data with an external dataset:
Evaluation
Results
Study Population
Synthetic Data Quality
Classification Experiments
Supplementing real data with synthetic data from the same origin:
Purely synthetic data:
...and 2 more sections

Figures (7)

Figure 1: Overview of forward and reverse diffusion processes.
Figure 2: Examples of the real and synthetic images obtained from the diffusion model using different seeds. Presented pathologies are what the model was actually conditioned on.
Figure 3: Performance evaluation of models trained on real data, real data supplemented by synthetic data, and synthetic data only on various test sets: (A) CheXpert Test, (B) MIMIC-CXR, and (C) Emory Chest X-ray. The red line in all graphs represents the baseline classifier model’s (trained only on real data from the CheXpert training set) performance on the target dataset.
Figure 4: Performance evaluation of models trained on the MIMIC-CXR training set (MIMICTr) with and without supplementation with synthetic data from external sources on various datasets: (A) CheXpert Test, and (B) Emory Chest X-ray.
Figure E1: Normalized label co-occurrence matrix for pathologies in the CheXpert dataset. For each condition on the row ($r$) of the heatmap, the corresponding column ($c$) indicates the ratio of all samples with condition $r$ that also have condition $c$.
...and 2 more figures

Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research

TL;DR

Abstract

Synthetically Enhanced: Unveiling Synthetic Data's Potential in Medical Imaging Research

Authors

TL;DR

Abstract

Table of Contents

Figures (7)