GuideGen: A Text-Guided Framework for Paired Full-torso Anatomy and CT Volume Generation
Linrui Dai, Rongzhao Zhang, Yongrui Yu, Xiaofan Zhang
TL;DR
<3-5 sentence high-level summary> GuideGen tackles the challenge of scarce, annotated 3D full-torso CT data by weaving together a text-driven semantic synthesizer, an anatomy-aware HDR autoencoder, and a latent-guided diffusion model to generate paired full-torso anatomies and CT volumes from textual prompts. The framework introduces a discrete, knowledge-augmented TCSS to reduce label ambiguity, preserves full intensity dynamics with an HDR-aware autoencoder guided by semantic masks, and fuses these with a diffusion-based latent generator conditioned on text for high-quality CT synthesis. It is trained on a multi-dataset collection from TCIA and an in-house RJ dataset and demonstrates superior generation quality, cross-modality alignment, and improved downstream segmentation usability compared with existing methods. This work enables scalable, text-driven dataset synthesis for medical image segmentation, potentially accelerating clinical research and treatment planning while reducing reliance on labor-intensive manual labeling.
Abstract
The recently emerging conditional diffusion models seem promising for mitigating the labor and expenses in building large 3D medical imaging datasets. However, previous studies on 3D CT generation primarily focus on specific organs characterized by a local structure and fixed contrast and have yet to fully capitalize on the benefits of both semantic and textual conditions. In this paper, we present GuideGen, a controllable framework based on easily-acquired text prompts to generate anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; an anatomy-aware high-dynamic-range (HDR) autoencoder for high-fidelity feature extraction across varying intensity levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts. Combined, these components enable data synthesis for segmentation tasks from only textual instructions. To train and evaluate GuideGen, we compile a multi-modality cancer imaging dataset with paired CT and clinical descriptions from 12 public TCIA datasets and one private real-world dataset. Comprehensive evaluations across generation quality, cross-modality alignment, and data usability on multi-organ and tumor segmentation tasks demonstrate GuideGen's superiority over existing CT generation methods. Relevant materials are available at https://github.com/OvO1111/GuideGen.
