Table of Contents
Fetching ...

Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

Che Liu, Zhongwei Wan, Haozhe Wang, Yinda Chen, Talha Qaiser, Chen Jin, Fariba Yousefi, Nikolay Burlutskiy, Rossella Arcucci

TL;DR

This work probes whether Medical Vision-Language Pre-training (MedVLP) can succeed with purely synthetic data. It introduces SynCXR, a synthetic dataset of $200{,}000$ image-text pairs generated without real data, using LLMs and a CXR-specific text-to-image model, with quality and distribution controls. MedVLP models pretrained on SynCXR outperformed real-data baselines by $3.8\%$ in averaged AUC, and mixtures of synthetic and real data yielded an additional $9.07\%$ improvement, along with gains in zero-shot grounding and downstream tasks. The results indicate that well-designed synthetic data can overcome real-data noise and long-tailed distributions, offering practical benefits for scalable, privacy-preserving medical VLP.

Abstract

Medical Vision-Language Pre-training (MedVLP) has made significant progress in enabling zero-shot tasks for medical image understanding. However, training MedVLP models typically requires large-scale datasets with paired, high-quality image-text data, which are scarce in the medical domain. Recent advancements in Large Language Models (LLMs) and diffusion models have made it possible to generate large-scale synthetic image-text pairs. This raises the question: "Can MedVLP succeed using purely synthetic data?" To address this, we use off-the-shelf generative models to create synthetic radiology reports and paired Chest X-ray (CXR) images, and propose an automated pipeline to build a diverse, high-quality synthetic dataset, enabling a rigorous study that isolates model and training settings, focusing entirely from the data perspective. Our results show that MedVLP models trained exclusively on synthetic data outperform those trained on real data by 3.8% in averaged AUC on zero-shot classification. Moreover, using a combination of synthetic and real data leads to a further improvement of 9.07%. Additionally, MedVLP models trained on synthetic or mixed data consistently outperform those trained on real data in zero-shot grounding, as well as in fine-tuned classification and segmentation tasks. Our analysis suggests MedVLP trained on well-designed synthetic data can outperform models trained on real datasets, which may be limited by low-quality samples and long-tailed distributions.

Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

TL;DR

This work probes whether Medical Vision-Language Pre-training (MedVLP) can succeed with purely synthetic data. It introduces SynCXR, a synthetic dataset of image-text pairs generated without real data, using LLMs and a CXR-specific text-to-image model, with quality and distribution controls. MedVLP models pretrained on SynCXR outperformed real-data baselines by in averaged AUC, and mixtures of synthetic and real data yielded an additional improvement, along with gains in zero-shot grounding and downstream tasks. The results indicate that well-designed synthetic data can overcome real-data noise and long-tailed distributions, offering practical benefits for scalable, privacy-preserving medical VLP.

Abstract

Medical Vision-Language Pre-training (MedVLP) has made significant progress in enabling zero-shot tasks for medical image understanding. However, training MedVLP models typically requires large-scale datasets with paired, high-quality image-text data, which are scarce in the medical domain. Recent advancements in Large Language Models (LLMs) and diffusion models have made it possible to generate large-scale synthetic image-text pairs. This raises the question: "Can MedVLP succeed using purely synthetic data?" To address this, we use off-the-shelf generative models to create synthetic radiology reports and paired Chest X-ray (CXR) images, and propose an automated pipeline to build a diverse, high-quality synthetic dataset, enabling a rigorous study that isolates model and training settings, focusing entirely from the data perspective. Our results show that MedVLP models trained exclusively on synthetic data outperform those trained on real data by 3.8% in averaged AUC on zero-shot classification. Moreover, using a combination of synthetic and real data leads to a further improvement of 9.07%. Additionally, MedVLP models trained on synthetic or mixed data consistently outperform those trained on real data in zero-shot grounding, as well as in fine-tuned classification and segmentation tasks. Our analysis suggests MedVLP trained on well-designed synthetic data can outperform models trained on real datasets, which may be limited by low-quality samples and long-tailed distributions.

Paper Structure

This paper contains 17 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison of real image-text datasets and synthetic datasets. (a): The real image-text dataset, MIMIC-CXR johnson2019mimicjpg, while authentic, often contains imperfections such as long-tailed data distribution, unpaired images and text, and low-quality CXR images, which limit the performance of MedVLP models pretrained on this dataset. (b): The synthetic dataset generation process uses clinical entities as prompts to an LLM (e.g., Llama3.1 llama3modelcard) to generate synthetic reports. These reports are then used to create synthetic images through RoentGen bluethgen2024vision. We propose an automated pipeline to control the dataset distribution, ensuring it is balanced and includes paired image-text samples.
  • Figure 2: (a): Examples of invalid or low-quality images filtered out by the proposed image curation method described in Sec \ref{['sec:clean img']}. (b): The image curation pipeline uses InternVL2 chen2023internvl, a Multimodal Large Language Model (MLLM), to assess CXR image quality. Images that meet the criteria are retained; others are discarded. (c): Entity frequency distribution in the MIMIC-CXR dataset. Due to space constraints, only the top 50 frequent entities for four categories (Abnormality, Non-Abnormality, Disease, Non-Disease) are shown. A more detailed distribution is presented in Fig \ref{['fig:abnoraml dist']},\ref{['fig:nonabnoraml dist']},\ref{['fig:anotomy dist']},\ref{['fig:disease dist']},\ref{['fig:nondisease dist']}.
  • Figure 3: Effectiveness of various factors on SynCXR dataset. Top: Impact of entity usage ratio on MedVLP performance for ConVIRT and GLoRIA methods. Bottom Left: Effectiveness of different LLMs for report generation on both MedVLP methods. Bottom Right: Effectiveness of different CXR image generation models for both MedVLP methods.
  • Figure 4: Distribution of Synthetic and Real Data. (a): Comparison of the first principal component distribution of features extracted from RAD-DINO for synthetic and real images. (b): Comparison of the first principal component distribution of features extracted from Med-CPT for synthetic and real reports.
  • Figure 5: Pipeline for generating synthetic reports. The process begins by generating the 'FINDINGS' section, followed by summarizing it into the 'IMPRESSION' section. Both sections are checked to ensure they contain the specified entities; if not, the generation process is repeated. The final dataset includes 200,000 synthetic reports, each containing both 'FINDINGS' and 'IMPRESSION' sections.
  • ...and 5 more figures