Table of Contents
Fetching ...

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi

TL;DR

This work tackles the challenge of generating high-resolution 3D CT volumes directly from textual radiology descriptions. It introduces a unified Text-to-CT pipeline that combines a 3D CLIP-based vision-language encoder, latent-space diffusion, and volumetric VAE compression to synthesize CT scans without external super-resolution steps. Empirical results on CT-RATE show strong image fidelity, high factual correctness, and robust semantic alignment between text and anatomy, with notable gains in downstream diagnostic utility when synthetic data augment real data. The approach offers scalable, controllable CT synthesis with potential applications in data augmentation, medical education, and clinical simulation, while highlighting the importance of modality-specific vision-language grounding for 3D medical image generation.

Abstract

Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric CT remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation. Code at https://github.com/cosbidev/Text2CT.

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

TL;DR

This work tackles the challenge of generating high-resolution 3D CT volumes directly from textual radiology descriptions. It introduces a unified Text-to-CT pipeline that combines a 3D CLIP-based vision-language encoder, latent-space diffusion, and volumetric VAE compression to synthesize CT scans without external super-resolution steps. Empirical results on CT-RATE show strong image fidelity, high factual correctness, and robust semantic alignment between text and anatomy, with notable gains in downstream diagnostic utility when synthetic data augment real data. The approach offers scalable, controllable CT synthesis with potential applications in data augmentation, medical education, and clinical simulation, while highlighting the importance of modality-specific vision-language grounding for 3D medical image generation.

Abstract

Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric CT remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation. Code at https://github.com/cosbidev/Text2CT.

Paper Structure

This paper contains 32 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the proposed Text-to-CT generation framework: (a) 3D CLIP Training: A contrastive learning setup aligns volumetric CT scans and radiology reports into a shared embedding space using a 3D Vision Transformer and a text encoder. (b) Diffusion UNet Training: A latent diffusion model is trained to denoise compressed CT representations, conditioned on textual embeddings. A pretrained VAE encoder compresses the volumes into latent vectors, which are noised and passed through a 3D U-Net, conditioned with report embeddings via cross-attention. (c) Inference: A synthetic latent code is generated from noise using the textual prompt, then decoded into a high-resolution CT volume via the VAE decoder.
  • Figure 2: Qualitative comparison of real and synthetic CT volumes generated by competing methods using the same textual prompt. The top three rows show axial, sagittal, and coronal slices, respectively, while the bottom row displays corresponding 3D volume renderings. Red boxes highlight regions affected by visual artifacts introduced by super-resolution stages. GenerateCT hamamci2024generatect suffers from severe inter-slice discontinuities due to its 2D upsampling, especially visible in sagittal and coronal views. MedSyn xu2024medsyn exhibits grid-like distortions from its 3D refinement stage. In contrast, our model produces anatomically coherent volumes across all planes and exhibits structural fidelity in the 3D rendering.
  • Figure 3: Effect of guidance scale on generation fidelity (FID 3D) and clinical relevance (Precision). While FID remains low and relatively stable across scales, precision peaks at a moderate guidance value, highlighting the importance of semantic controllability beyond visual similarity.