MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis

Joseph Cho; Mrudang Mathur; Cyril Zakka; Dhamanpreet Kaur; Matthew Leipzig; Alex Dalal; Aravind Krishnan; Eubee Koo; Karen Wai; Cindy S. Zhao; Akshay Chaudhari; Matthew Duda; Ashley Choi; Ehsan Rahimy; Lyna Azzouz; Robyn Fong; Rohan Shad; William Hiesinger

MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis

Joseph Cho, Mrudang Mathur, Cyril Zakka, Dhamanpreet Kaur, Matthew Leipzig, Alex Dalal, Aravind Krishnan, Eubee Koo, Karen Wai, Cindy S. Zhao, Akshay Chaudhari, Matthew Duda, Ashley Choi, Ehsan Rahimy, Lyna Azzouz, Robyn Fong, Rohan Shad, William Hiesinger

TL;DR

MediSyn introduces a text-guided latent diffusion model trained on a large, public multicenter medical image-text corpus to synthesize images across 6 specialties and 10 image types, addressing data scarcity and privacy concerns. It achieves competitive fidelity and diversity relative to specialist generators, with expert physicians confirming realism and text alignment, and demonstrates that synthetic data can enhance classifier performance in data-limited settings. Importantly, MediSyn’s outputs are largely distinct from training data, supporting privacy goals while enabling broad utility for medical algorithm development. The work highlights the promise of generalist, text-guided diffusion models to accelerate medical AI research while reducing dependence on sensitive real-world data.

Abstract

Deep learning algorithms require extensive data to achieve robust performance. However, data availability is often restricted in the medical domain due to patient privacy concerns. Synthetic data presents a possible solution to these challenges. Recently, image generative models have found increasing use for medical applications but are often designed for singular medical specialties and imaging modalities, thus limiting their broader utility. To address this, we introduce MediSyn: a text-guided, latent diffusion model capable of generating synthetic images from 6 medical specialties and 10 image types. Through extensive experimentation, we first demonstrate that MediSyn quantitatively matches or surpasses the performance of specialist models. Second, we show that our synthetic images are realistic and exhibit strong alignment with their corresponding text prompts, as validated by a team of expert physicians. Third, we provide empirical evidence that our synthetic images are visually distinct from their corresponding real patient images. Finally, we demonstrate that in data-limited settings, classifiers trained solely on synthetic data or real data supplemented with synthetic data can outperform those trained solely on real data. Our findings highlight the immense potential of generalist image generative models to accelerate algorithmic research and development in medicine.

MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 5 figures, 2 tables)

This paper contains 23 sections, 4 equations, 5 figures, 2 tables.

Introduction
Results
MediSyn matches or surpasses the performance of specialist models
MediSyn generates realistic and text-aligned medical images
MediSyn does not reproduce images from our training data
Synthetic images generated by MediSyn maintain or improve classifier performance
Discussion
Methods
Dataset curation and pre-processing
Model architecture and training
Model Evaluations
Assessment of quantitative performance against specialist models
Assessments of realism and text-alignment
Assessment for training data reproduction
Assessment for algorithmic training using synthetic data
...and 8 more sections

Figures (5)

Figure 1: Overview of MediSyn framework.a. Training dataset: A large-scale corpus of medical image-text pairs from 6 medical specialties was curated from the public domain. b. Training procedure: A text-conditional U-Net is trained to denoise a latent space representation of an image intentionally corrupted with Gaussian noise. c. Inference procedure (text-to-image generation): The trained U-Net progressively denoises a latent vector sampled from a Gaussian distribution, which is then decoded by a variational autoencoder (VAE) into a high-quality synthetic medical image. MSE: mean squared error.
Figure 2: Text-conditioned synthesis of medical images. A series of synthetic images generated by MediSyn, covering 6 medical specialties and 10 image types. The accompanying captions served as the text prompts for our model.
Figure 3: Physician assessments of synthetic and real medical images.a. Surgeons were asked to select any synthetic image(s) in image pairs of laparoscopic cholecystectomy. b. Surgeon performance while identifying synthetic surgical images. c. Additionally, surgeons were asked to classify the surgical phase for a set of real and synthetic laparoscopic cholecystectomy images. d. Surgeon performance on the surgical image classification task. e. Ophthalmologists were asked to select any synthetic image(s) in image pairs of optical coherence tomography images. f. Ophthalmologist performance while identifying synthetic optical coherence tomography images. g. Additionally, ophthalmologists were asked to classify the disease condition for a set of real and synthetic optical coherence tomography images. h. Ophthalmologist performance on the optical coherence tomography image classification task. Please note that experience is measured as time in current position. Metrics are annotated with an upward arrow to indicate that higher values reflect better physician performance.
Figure 4: Assessment of training data reproduction.a. For each synthetic image, we found its nearest neighbor in the embedding space of BiomedCLIP's vision encoder. b. For each synthetic-real image pair, we calculated a normalized Euclidean distance between the two images. c. Additional examples of synthetic images alongside their nearest neighbor from the training dataset.
Figure 5: Performance of classifiers trained on either real data, synthetic data, or real data supplemented with synthetic data. Macro-averaged test AUROC across five runs for ResNet-50 classifiers trained on varying proportions of the real data, the synthetic data generated by MediSyn, or the real data supplemented with the synthetic data. Classifiers were trained and evaluated on: a. chest X-ray images to classify diseases (multi-label classification), b. dermoscopy images to classify diseases (multi-class classification, and c. robot-assisted radical prostatectomy images to classify surgical actions (multi-class classification). All data presented as mean $\pm$ 1 standard deviation.

MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis

TL;DR

Abstract

MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (5)