Table of Contents
Fetching ...

Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

Delin An, Chaoli Wang

Abstract

Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at https://github.com/adlsn/Sketch2CT.

Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

Abstract

Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations. Built upon a capsule-attention backbone, these modules leverage the complementary strengths of sketches and text to produce anatomically accurate organ shapes. The synthesized segmentation masks subsequently guide a latent diffusion model for 3D CT volume synthesis, enabling realistic reconstruction of organ appearances that are consistent with user-defined sketches and descriptions. Extensive experiments on public CT datasets demonstrate that Sketch2CT achieves superior performance in generating multimodal medical volumes. Its controllable, low-cost generation pipeline enables principled, efficient augmentation of medical datasets. Code is available at https://github.com/adlsn/Sketch2CT.
Paper Structure (17 sections, 10 equations, 7 figures, 4 tables)

This paper contains 17 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Multimodal sketch and text guided 3D medical image generation results from our Sketch2CT method. For each organ, a user-provided sketch and text description serve as structural and semantic conditions. Sketch2CT produces both a 3D segmentation mask and a synthesized medical volume that closely follows the geometry and anatomy implied by the input. S1-S7 denote seven consecutive axial slices from the synthesized 3D volume, illustrating spatial continuity.
  • Figure 2: Overview of our Sketch2CT framework. (a) A capsule-based sketch encoder and a sentence transformer extract structural and semantic features, which are fused via a FiLM module. (b) The fused representation conditions a segmentation latent diffusion model to generate 3D organ masks. (c) The predicted segmentation latent guides an image latent diffusion model to synthesize 3D medical volumes.
  • Figure 3: Qualitative comparison of baseline methods. For each case, we extract a sketch and text description from the ground-truth mask to encode organ geometry. These features guide the generation of 3D masks and the synthesis of 3D volumes.
  • Figure A1: Examples of textual geometry descriptions used as the text-based conditioning input in Sketch2CT. For each organ, we show the input sketch, the generated mask, and the corresponding structured text description that captures the global shape, surface characteristics, symmetry, and topology.
  • Figure A2: Effect of sketch granularity on segmentation mask generation. For each dataset, we vary the sketch detail in three levels, coarse, medium, and fine, and visualize the corresponding 3D masks produced by Sketch2CT. Increased sketch detail introduces more local structural variations, while the overall anatomical geometry remains consistent across granularity levels.
  • ...and 2 more figures