Table of Contents
Fetching ...

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

Ibrahim Ethem Hamamci, Sezgin Er, Anjany Sekuboyina, Enis Simsar, Alperen Tezcan, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Furkan Almas, Irem Dogan, Muhammed Furkan Dasdelen, Chinmay Prabhakar, Hadrien Reynaud, Sarthak Pati, Christian Bluethgen, Mehmet Kemal Ozdemir, Bjoern Menze

TL;DR

GenerateCT introduces the first framework for text-conditioned generation of 3D chest CT volumes, combining a CT-ViT autoregressive encoder–decoder, a masked vision–language transformer for radiology text alignment, and a text-conditioned cascaded diffusion model for super-resolution. It demonstrates superior 3D generation quality over baselines, enables scalable synthetic data generation, and achieves practical clinical value through data augmentation and zero-shot generalization to external datasets. The work provides extensive quantitative and qualitative evaluations, including expert assessments of realism and text alignment, and shows substantial improvements in abnormality classification when using synthetic data. By releasing models and data, GenerateCT offers a foundation for privacy-preserving data synthesis in radiology and paves the way for broader adoption of text-to-3D medical imaging.

Abstract

GenerateCT, the first approach to generating 3D medical imaging conditioned on free-form medical text prompts, incorporates a text encoder and three key components: a novel causal vision transformer for encoding 3D CT volumes, a text-image transformer for aligning CT and text tokens, and a text-conditional super-resolution diffusion model. Without directly comparable methods in 3D medical imaging, we benchmarked GenerateCT against cutting-edge methods, demonstrating its superiority across all key metrics. Importantly, we evaluated GenerateCT's clinical applications in a multi-abnormality classification task. First, we established a baseline by training a multi-abnormality classifier on our real dataset. To further assess the model's generalization to external data and performance with unseen prompts in a zero-shot scenario, we employed an external set to train the classifier, setting an additional benchmark. We conducted two experiments in which we doubled the training datasets by synthesizing an equal number of volumes for each set using GenerateCT. The first experiment demonstrated an 11% improvement in the AP score when training the classifier jointly on real and generated volumes. The second experiment showed a 7% improvement when training on both real and generated volumes based on unseen prompts. Moreover, GenerateCT enables the scaling of synthetic training datasets to arbitrary sizes. As an example, we generated 100,000 3D CTs, fivefold the number in our real set, and trained the classifier exclusively on these synthetic CTs. Impressively, this classifier surpassed the performance of the one trained on all available real data by a margin of 8%. Last, domain experts evaluated the generated volumes, confirming a high degree of alignment with the text prompt. Access our code, model weights, training data, and generated data at https://github.com/ibrahimethemhamamci/GenerateCT

GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes

TL;DR

GenerateCT introduces the first framework for text-conditioned generation of 3D chest CT volumes, combining a CT-ViT autoregressive encoder–decoder, a masked vision–language transformer for radiology text alignment, and a text-conditioned cascaded diffusion model for super-resolution. It demonstrates superior 3D generation quality over baselines, enables scalable synthetic data generation, and achieves practical clinical value through data augmentation and zero-shot generalization to external datasets. The work provides extensive quantitative and qualitative evaluations, including expert assessments of realism and text alignment, and shows substantial improvements in abnormality classification when using synthetic data. By releasing models and data, GenerateCT offers a foundation for privacy-preserving data synthesis in radiology and paves the way for broader adoption of text-to-3D medical imaging.

Abstract

GenerateCT, the first approach to generating 3D medical imaging conditioned on free-form medical text prompts, incorporates a text encoder and three key components: a novel causal vision transformer for encoding 3D CT volumes, a text-image transformer for aligning CT and text tokens, and a text-conditional super-resolution diffusion model. Without directly comparable methods in 3D medical imaging, we benchmarked GenerateCT against cutting-edge methods, demonstrating its superiority across all key metrics. Importantly, we evaluated GenerateCT's clinical applications in a multi-abnormality classification task. First, we established a baseline by training a multi-abnormality classifier on our real dataset. To further assess the model's generalization to external data and performance with unseen prompts in a zero-shot scenario, we employed an external set to train the classifier, setting an additional benchmark. We conducted two experiments in which we doubled the training datasets by synthesizing an equal number of volumes for each set using GenerateCT. The first experiment demonstrated an 11% improvement in the AP score when training the classifier jointly on real and generated volumes. The second experiment showed a 7% improvement when training on both real and generated volumes based on unseen prompts. Moreover, GenerateCT enables the scaling of synthetic training datasets to arbitrary sizes. As an example, we generated 100,000 3D CTs, fivefold the number in our real set, and trained the classifier exclusively on these synthetic CTs. Impressively, this classifier surpassed the performance of the one trained on all available real data by a margin of 8%. Last, domain experts evaluated the generated volumes, confirming a high degree of alignment with the text prompt. Access our code, model weights, training data, and generated data at https://github.com/ibrahimethemhamamci/GenerateCT
Paper Structure (16 sections, 3 equations, 7 figures, 3 tables)

This paper contains 16 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: GenerateCT is a cascaded framework that generates high-resolution and high-fidelity 3D chest CT volumes based on medical language text prompts.
  • Figure 1: Example 2D slices of generated 3D CT volumes with varied windowing settings. Each example includes three windowing settings for the same slice: (1) within the raw HU range of $[-1000~\text{HU}, +1000~\text{HU}]$, (2) lung window within the range of $[-1000~\text{HU}, +150~\text{HU}]$, and (3) mediastinal window within the range of $[-125~\text{HU}, +225~\text{HU}]$. This highlights GenerateCT's ability to produce highly detailed and clinically accurate 3D chest CT volumes based on text descriptions.
  • Figure 2: The GenerateCT architecture consists of three main components. (1) The CT-ViT encoder architecture processes the embeddings of CT patches from raw slices S through a spatial transformer followed by a causal transformer (auto-regressive in-depth), generating CT tokens. (2) The vision-language transformer is trained to reconstruct masked tokens based on the frozen CT-ViT encoder's predictions, conditioned on T5X text prompt tokens. (3) A text-conditional diffusion model is employed to upsample low-resolution slices from generated 3D chest CT volumes. Finally, GenerateCT demonstrates the capability to generate high-resolution 3D chest CT volumes with arbitrary slice numbers conditioned on medical language text prompts.
  • Figure 2: Cross-attention maps illustrate specific abnormalities in the text-conditional generation of 3D chest CT volumes with varied windowing settings, underscoring GenerateCT's precision in translating medical terminology into clinically relevant image features in the corresponding areas. Although our work generates comprehensive 3D chest CT volumes, we present only 2D axial slices due to presentation and visualization constraints. These slices act as representative examples to demonstrate the depth and detail GenerateCT can achieve, providing insights into its ability to accurately depict complex anatomical structures and abnormalities in a three-dimensional context.
  • Figure 4: Three sequential slices from each synthetic 3D chest CT within the practical HU range of $[-1000~\text{HU}, +1000~\text{HU}]$ generated based on the given prompt, showcasing GenerateCT's proficiency in preserving spatial consistency across successive slices. Abnormalities referenced in the prompts are color-highlighted, underscoring our method's precision in translating textual descriptions into clinically accurate volumetric features.
  • ...and 2 more figures