Table of Contents
Fetching ...

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

Pierre Chambon, Christian Bluethgen, Jean-Benoit Delbrouck, Rogier Van der Sluijs, Małgorzata Połacin, Juan Manuel Zambrano Chaves, Tanishq Mathew Abraham, Shivanshu Purohit, Curtis P. Langlotz, Akshay Chaudhari

TL;DR

RoentGen demonstrates that a large vision-language latent diffusion model can be effectively domain-adapted to generate high-fidelity, controllable chest X-rays conditioned on medical language prompts. By fine-tuning multiple SD components and optionally replacing the text encoder with domain-specific variants, the model achieves strong fidelity and medical correctness across radiology tasks, including classification augmentation and radiology report generation. The work introduces a robust evaluation framework spanning image fidelity, diversity, and factual alignment, and shows that synthetic CXRs can meaningfully improve downstream tasks as data augmentation. It also reveals that in-domain knowledge can be distilled into the text encoder but warns of catastrophic forgetting, motivating methods to preserve prior knowledge while expanding domain capabilities.

Abstract

Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets. In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

TL;DR

RoentGen demonstrates that a large vision-language latent diffusion model can be effectively domain-adapted to generate high-fidelity, controllable chest X-rays conditioned on medical language prompts. By fine-tuning multiple SD components and optionally replacing the text encoder with domain-specific variants, the model achieves strong fidelity and medical correctness across radiology tasks, including classification augmentation and radiology report generation. The work introduces a robust evaluation framework spanning image fidelity, diversity, and factual alignment, and shows that synthetic CXRs can meaningfully improve downstream tasks as data augmentation. It also reveals that in-domain knowledge can be distilled into the text encoder but warns of catastrophic forgetting, motivating methods to preserve prior knowledge while expanding domain capabilities.

Abstract

Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets. In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.
Paper Structure (26 sections, 3 equations, 6 figures, 6 tables)

This paper contains 26 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Text-to-image synthesis of chest x-ray images using $RoentGen$, a medical domain-adapted latent diffusion model based on the Stable Diffusion pipeline. A fine-tuned or retrained conditional U-Net denoises a vector of random Gaussian noise, conditioned by embeddings created from short medical free-text prompts by a fine-tuned or replaced text encoder. The decoder of the variational autoencoder of the Stable Diffusion pipeline maps the denoised latent vector to pixel space, resulting in high-fidelity, diverse chest x-rays showing corresponding imaging features.
  • Figure 2: Text-conditioned synthesis of CXR. Each image was hand-picked out of four generated CXR per respective prompt. Here, presence or absence of a finding (pleural effusions, dotted ROI added for visualization) and dimensions like size and laterality were controlled via prompting. Note that the model correctly incorporated the radiological convention of displaying the right patient side on the left side of the image, and vice versa.
  • Figure 3: Intra-prompt image diversity by CLIP token length. Mean MS-SSIM as a measure of intra-prompt generation diversity for 5,000 prompts (with 4 generated images per prompt) for selected models. Lower mean MS-SSIM indicates higher diversity. For visualization, CLIP token lengths have been binned to intervals of size 10. The light areas indicate 95% confidence intervals. SD: Stable Diffusion. PA: Postero-anterior view. AP: Antero-posterior view. Lat: Lateral view.
  • Figure 4: Synthetic images created by prompting a fine-tuned model (60k training steps; learning rate 5e-5; PA-view) for typical CXR abnormalities. The generated CXRs feature high levels of detail: When prompted for "edema" (top right), perihilar haziness(white arrowheads) and peribronchial cuffing (black arrowhead), both features seen in pulmonary edema, can be observed. For 'pneumothorax' (bottom row, third image from the left), a fine line representing the visceral pleural lining of the partially collapsed lung can be delineated (dashed line).
  • Figure 5: Intra-prompt synthetic image diversity. Four generated samples using the prompts 'Big right-sided pleural effusion with adjacent atelectasis' (top row) and 'Big left-sided pleural effusion with adjacent atelectasis' (bottom row). Note the diverse apperance of the right-sided pleural effusion with atelectasis and varying amounts of interlobar fluid (top row, white arrowheads), and the differences in contrast, with higher contrasts especially in the three left images in the bottom row, similar to real CXR when using different X-ray tube voltage settings.
  • ...and 1 more figures