VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein
TL;DR
VoxTell addresses free-text prompted 3D medical image segmentation by mapping prompts to masks for volumes $V \in \mathbb{R}^{H \times W \times D}$. It introduces multi‑stage vision–language fusion that injects text guidance at multiple decoder scales, enabling accurate, instance-aware segmentation across CT, MRI, and PET. Trained on over 62K volumes and 1,087 concepts with a large vocabulary of 9,682 rewritten labels, VoxTell achieves state-of-the-art zero-shot performance and robust cross-modality generalization, including prompts derived from real radiology reports. This work demonstrates clinically meaningful language-driven segmentation and points toward open‑set generalization through few-shot extensions and richer text supervision.
Abstract
We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell
