Table of Contents
Fetching ...

MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT Images

Yanwu Xu, Li Sun, Wei Peng, Shuyue Jia, Katelyn Morrison, Adam Perer, Afrooz Zandifar, Shyam Visweswaran, Motahhare Eslami, Kayhan Batmanghelich

TL;DR

This study focuses on the development of a method for creating images based on textual prompts and anatomical components, and the capability to generate new images conditioning on anatomical elements, and the capability to generate new images conditioning on anatomical elements.

Abstract

This paper introduces an innovative methodology for producing high-quality 3D lung CT images guided by textual information. While diffusion-based generative models are increasingly used in medical imaging, current state-of-the-art approaches are limited to low-resolution outputs and underutilize radiology reports' abundant information. The radiology reports can enhance the generation process by providing additional guidance and offering fine-grained control over the synthesis of images. Nevertheless, expanding text-guided generation to high-resolution 3D images poses significant memory and anatomical detail-preserving challenges. Addressing the memory issue, we introduce a hierarchical scheme that uses a modified UNet architecture. We start by synthesizing low-resolution images conditioned on the text, serving as a foundation for subsequent generators for complete volumetric data. To ensure the anatomical plausibility of the generated samples, we provide further guidance by generating vascular, airway, and lobular segmentation masks in conjunction with the CT images. The model demonstrates the capability to use textual input and segmentation tasks to generate synthesized images. The results of comparative assessments indicate that our approach exhibits superior performance compared to the most advanced models based on GAN and diffusion techniques, especially in accurately retaining crucial anatomical features such as fissure lines, airways, and vascular structures. This innovation introduces novel possibilities. This study focuses on two main objectives: (1) the development of a method for creating images based on textual prompts and anatomical components, and (2) the capability to generate new images conditioning on anatomical elements. The advancements in image generation can be applied to enhance numerous downstream tasks.

MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT Images

TL;DR

This study focuses on the development of a method for creating images based on textual prompts and anatomical components, and the capability to generate new images conditioning on anatomical elements, and the capability to generate new images conditioning on anatomical elements.

Abstract

This paper introduces an innovative methodology for producing high-quality 3D lung CT images guided by textual information. While diffusion-based generative models are increasingly used in medical imaging, current state-of-the-art approaches are limited to low-resolution outputs and underutilize radiology reports' abundant information. The radiology reports can enhance the generation process by providing additional guidance and offering fine-grained control over the synthesis of images. Nevertheless, expanding text-guided generation to high-resolution 3D images poses significant memory and anatomical detail-preserving challenges. Addressing the memory issue, we introduce a hierarchical scheme that uses a modified UNet architecture. We start by synthesizing low-resolution images conditioned on the text, serving as a foundation for subsequent generators for complete volumetric data. To ensure the anatomical plausibility of the generated samples, we provide further guidance by generating vascular, airway, and lobular segmentation masks in conjunction with the CT images. The model demonstrates the capability to use textual input and segmentation tasks to generate synthesized images. The results of comparative assessments indicate that our approach exhibits superior performance compared to the most advanced models based on GAN and diffusion techniques, especially in accurately retaining crucial anatomical features such as fissure lines, airways, and vascular structures. This innovation introduces novel possibilities. This study focuses on two main objectives: (1) the development of a method for creating images based on textual prompts and anatomical components, and (2) the capability to generate new images conditioning on anatomical elements. The advancements in image generation can be applied to enhance numerous downstream tasks.
Paper Structure (33 sections, 10 equations, 8 figures, 6 tables)

This paper contains 33 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of our generative model, MedSyn. Using a hierarchical approach, we first generate a 64$\times$ 64 $\times$ 64 low-resolution volume, along with its anatomical components, conditioning on Gaussian noise $\epsilon$ and radiology report. The low-resolution volumes are then seamlessly upscaled to a detailed 256$\times$ 256 $\times$ 256 resolution.
  • Figure 2: This figure shows our efficient low-res generative model with the clinical tokens input. In this process, we train the denoising diffusion UNet and fix the pre-trained text feature extractor of Medical BERT. To be notified, our low-res base model has a large capacity of 700 million parameters.
  • Figure 3: Randomly generated images (from HA-GAN and Medical Diffusion) and the real images. The first two columns show axial and coronal slices, which use the HU range of [-1024, 600]. The last column shows the zoom-in region and uses HU range of [-1024, -250] to highlight the lung details. Our method is the only one that can preserve delicate anatomical details, including fissures, as indicated by the arrows.
  • Figure 4: Images conditionally generated with disease-related prompts. We show the real images in the first two columns. Then we extract disease-related mentions from their associated reports as conditions to generate images, which are shown in the third and fourth columns. We also show the synthesized samples by conditioning on prompts reversed of the disease in the last two columns. Four slices are shown for each subject. The generated images are conditioned on text only.
  • Figure 5: Distribution of cardiothoracic ratio for images generated conditioning on different prompt types. The results show that when feeding prompt with cardiomegaly mentioning, the generated images will have higher CTR.
  • ...and 3 more figures