Table of Contents
Fetching ...

GuideGen: A Text-Guided Framework for Paired Full-torso Anatomy and CT Volume Generation

Linrui Dai, Rongzhao Zhang, Yongrui Yu, Xiaofan Zhang

TL;DR

<3-5 sentence high-level summary> GuideGen tackles the challenge of scarce, annotated 3D full-torso CT data by weaving together a text-driven semantic synthesizer, an anatomy-aware HDR autoencoder, and a latent-guided diffusion model to generate paired full-torso anatomies and CT volumes from textual prompts. The framework introduces a discrete, knowledge-augmented TCSS to reduce label ambiguity, preserves full intensity dynamics with an HDR-aware autoencoder guided by semantic masks, and fuses these with a diffusion-based latent generator conditioned on text for high-quality CT synthesis. It is trained on a multi-dataset collection from TCIA and an in-house RJ dataset and demonstrates superior generation quality, cross-modality alignment, and improved downstream segmentation usability compared with existing methods. This work enables scalable, text-driven dataset synthesis for medical image segmentation, potentially accelerating clinical research and treatment planning while reducing reliance on labor-intensive manual labeling.

Abstract

The recently emerging conditional diffusion models seem promising for mitigating the labor and expenses in building large 3D medical imaging datasets. However, previous studies on 3D CT generation primarily focus on specific organs characterized by a local structure and fixed contrast and have yet to fully capitalize on the benefits of both semantic and textual conditions. In this paper, we present GuideGen, a controllable framework based on easily-acquired text prompts to generate anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; an anatomy-aware high-dynamic-range (HDR) autoencoder for high-fidelity feature extraction across varying intensity levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts. Combined, these components enable data synthesis for segmentation tasks from only textual instructions. To train and evaluate GuideGen, we compile a multi-modality cancer imaging dataset with paired CT and clinical descriptions from 12 public TCIA datasets and one private real-world dataset. Comprehensive evaluations across generation quality, cross-modality alignment, and data usability on multi-organ and tumor segmentation tasks demonstrate GuideGen's superiority over existing CT generation methods. Relevant materials are available at https://github.com/OvO1111/GuideGen.

GuideGen: A Text-Guided Framework for Paired Full-torso Anatomy and CT Volume Generation

TL;DR

<3-5 sentence high-level summary> GuideGen tackles the challenge of scarce, annotated 3D full-torso CT data by weaving together a text-driven semantic synthesizer, an anatomy-aware HDR autoencoder, and a latent-guided diffusion model to generate paired full-torso anatomies and CT volumes from textual prompts. The framework introduces a discrete, knowledge-augmented TCSS to reduce label ambiguity, preserves full intensity dynamics with an HDR-aware autoencoder guided by semantic masks, and fuses these with a diffusion-based latent generator conditioned on text for high-quality CT synthesis. It is trained on a multi-dataset collection from TCIA and an in-house RJ dataset and demonstrates superior generation quality, cross-modality alignment, and improved downstream segmentation usability compared with existing methods. This work enables scalable, text-driven dataset synthesis for medical image segmentation, potentially accelerating clinical research and treatment planning while reducing reliance on labor-intensive manual labeling.

Abstract

The recently emerging conditional diffusion models seem promising for mitigating the labor and expenses in building large 3D medical imaging datasets. However, previous studies on 3D CT generation primarily focus on specific organs characterized by a local structure and fixed contrast and have yet to fully capitalize on the benefits of both semantic and textual conditions. In this paper, we present GuideGen, a controllable framework based on easily-acquired text prompts to generate anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; an anatomy-aware high-dynamic-range (HDR) autoencoder for high-fidelity feature extraction across varying intensity levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts. Combined, these components enable data synthesis for segmentation tasks from only textual instructions. To train and evaluate GuideGen, we compile a multi-modality cancer imaging dataset with paired CT and clinical descriptions from 12 public TCIA datasets and one private real-world dataset. Comprehensive evaluations across generation quality, cross-modality alignment, and data usability on multi-organ and tumor segmentation tasks demonstrate GuideGen's superiority over existing CT generation methods. Relevant materials are available at https://github.com/OvO1111/GuideGen.
Paper Structure (19 sections, 11 equations, 8 figures, 13 tables)

This paper contains 19 sections, 11 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Overview of GuideGen's training pipeline: (a) Firstly, GuideGen learns to generate discrete semantic volumes that conforms to spatial features designated in the medical prompt (See Sec.3.1); (b) Secondly, GuideGen deploys a pyramidal autoencoding scheme to incorporate mask knowledge and reconstruct fine CT details with a high dynamic range (See Sec.3.2); (c) Finally, GuideGen combines the semantic latents derived from (a), image latents extracted in (b) and textual latents from the medical prompt to synthesize full-torso CT images (See Sec.3.3); (d) The internal structure of our knowledge injection module for extracting task-specific features from a structured input used in (a) and (c).
  • Figure 2: Qualitative results of different generation methods conditioned on the same textual prompts. Mask inputs to baseline models are generated with GuideGen with tumor semantics masked in red (the first column). Two slices are shown per case. For better visualization, we use a CT intensity window of [-975,-225]HU for displaying the chest region (rows 1-2), and [-50,150]HU for the abdominal region (rows 3-6) seeram2015computed. See our project page for more qualitative results.
  • Figure 3: (a) Qualitative results of generated full-torso anatomical masks, with tumor masks masked in red. (b) Quantitative results evaluating GuideGen's mask-prompt alignment from two dimensions including the number of tumor and tumor location.
  • Figure 4: Segmentation performance using different number (0, 100, 200, 500, 1K) of GuideGen-generated samples as augmentation. Darker and lighter solid lines separately denote segmentation model trained with or without real data.
  • Figure 5: Demographics distribution for TCIA and RJ datasets used for generative training as well as quality and alignment evaluations.
  • ...and 3 more figures