Table of Contents
Fetching ...

VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein

TL;DR

VoxTell addresses free-text prompted 3D medical image segmentation by mapping prompts to masks for volumes $V \in \mathbb{R}^{H \times W \times D}$. It introduces multi‑stage vision–language fusion that injects text guidance at multiple decoder scales, enabling accurate, instance-aware segmentation across CT, MRI, and PET. Trained on over 62K volumes and 1,087 concepts with a large vocabulary of 9,682 rewritten labels, VoxTell achieves state-of-the-art zero-shot performance and robust cross-modality generalization, including prompts derived from real radiology reports. This work demonstrates clinically meaningful language-driven segmentation and points toward open‑set generalization through few-shot extensions and richer text supervision.

Abstract

We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell

VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation

TL;DR

VoxTell addresses free-text prompted 3D medical image segmentation by mapping prompts to masks for volumes . It introduces multi‑stage vision–language fusion that injects text guidance at multiple decoder scales, enabling accurate, instance-aware segmentation across CT, MRI, and PET. Trained on over 62K volumes and 1,087 concepts with a large vocabulary of 9,682 rewritten labels, VoxTell achieves state-of-the-art zero-shot performance and robust cross-modality generalization, including prompts derived from real radiology reports. This work demonstrates clinically meaningful language-driven segmentation and points toward open‑set generalization through few-shot extensions and richer text supervision.

Abstract

We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell

Paper Structure

This paper contains 41 sections, 7 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: VoxTell performs 3D medical image segmentation directly from arbitrary free-text prompts. The figure shows progressively challenging scenarios: (a) known anatomical structures seen during training, (b) generalization of learned concepts to other imaging modalities, (c) novel concepts never encountered during training, and (d) clinical language understanding from real radiology reports with spatially grounded descriptions. The bar chart (right) reports Dice scores on a held-out radiotherapy cohort using report-derived prompts, shown in (d), where VoxTell outperforms prior text-promptable segmentation methods.
  • Figure 2: Overview of VoxTell.Left: A 3D image volume is encoded into latents, while a free-text prompt is first embedded and then processed by a prompt decoder to produce multi-scale text features that guide image decoding. Right: The decoder performs multi-stage vision–language fusion: at each resolution, text embeddings modulate volumetric features, extending MaskFormer-style query–image fusion to multiple scales with deep supervision.
  • Figure 3: Prompt Stability. Dice score distributions of all methods across multiple textual prompts for the same anatomical structure. Competing methods exhibit high variability, often failing on certain synonyms or misspellings, while VoxTell maintains consistently high performance, even on prompts not seen during training.
  • Figure 4: Free-Text Segmentation on ReXGroundingCT. Evaluation on the ReXGroundingCT benchmark baharoon2025state (validation set), which links radiology report findings to 3D segmentations in CT-RATE ct-rate chest CTs, assessing instance-level localization and segmentation from text. Following the benchmark protocol, both the current SoTA, SAT and VoxTell were fine-tuned on the training set. VoxTell outperforms SAT in Dice and hit-rate HIT$_{5\%}$ (the fraction of instances with Dice $\ge$ 5%).
  • Figure 5: Qualitative comparison of text-prompted segmentation across varying prompt complexity. (a) Known anatomical concepts, (b) unseen pathological structures, and (c) sentence-level clinical descriptions from in-house radiology reports. VoxTell produces accurate segmentations across all prompt types, while competing methods struggle on in-distribution prompts and fail on unseen or complex queries.
  • ...and 5 more figures