Large-Vocabulary Segmentation for Medical Images with Text Prompts

Ziheng Zhao; Yao Zhang; Chaoyi Wu; Xiaoman Zhang; Xiao Zhou; Ya Zhang; Yanfeng Wang; Weidi Xie

Large-Vocabulary Segmentation for Medical Images with Text Prompts

Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR

This work introduces Segment Anything with Text (SAT), a large-vocabulary, text-prompted segmentation framework for 3D medical images. It combines a multimodal medical knowledge tree with a knowledge-enhanced text encoder and a 3D segmentation backbone to enable automatic segmentation across 497 targets and eight body regions. SAT-Pro achieves competitive results with a fraction of the parameter count compared to 72 specialist nnU-Nets and generalizes well to cross-center datasets, outperforming interactive segmentation baselines like MedSAM and 2D text-prompted methods. The approach demonstrates the potential to ground language models in clinical segmentation tasks and highlights future directions for open-vocabulary medical segmentation and language-grounded grounding in radiology pipelines.

Abstract

This paper aims to build a model that can Segment Anything in 3D medical images, driven by medical terminologies as Text prompts, termed as SAT. Our main contributions are three-fold: (i) We construct the first multimodal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then, we build the largest and most comprehensive segmentation dataset for training, collecting over 22K 3D scans from 72 datasets, across 497 classes, with careful standardization on both image and label space; (ii) We propose to inject medical knowledge into a text encoder via contrastive learning and formulate a large-vocabulary segmentation model that can be prompted by medical terminologies in text form; (iii) We train SAT-Nano (110M parameters) and SAT-Pro (447M parameters). SAT-Pro achieves comparable performance to 72 nnU-Nets -- the strongest specialist models trained on each dataset (over 2.2B parameters combined) -- over 497 categories. Compared with the interactive approach MedSAM, SAT-Pro consistently outperforms across all 7 human body regions with +7.1% average Dice Similarity Coefficient (DSC) improvement, while showing enhanced scalability and robustness. On 2 external (cross-center) datasets, SAT-Pro achieves higher performance than all baselines (+3.7% average DSC), demonstrating superior generalization ability.

Large-Vocabulary Segmentation for Medical Images with Text Prompts

TL;DR

Abstract

Paper Structure (39 sections, 14 equations, 19 figures, 38 tables)

This paper contains 39 sections, 14 equations, 19 figures, 38 tables.

Introduction
Results
Comparison with Specialist Models on Automatic Segmentation
Comparison with Interactive Segmentation Foundation Model
Compare with Text-Prompted Segmentation Foundation Model
Evaluation on External Datasets.
Ablation Study on Text Encoder
Qualitative Results -- SAT as an Interface Between Language and Segmentation
Discussion
Method
Domain Knowledge
Segmentation Dataset
Large-Vocabulary Segmentation Prompted by Text
Multimodal Knowledge Injection
Segmentation Training
...and 24 more sections

Figures (19)

Figure 1: Segment Anything in 3D medical images with Text. In contrast to conventional specialist models (a) that develop specialized solution for each task, or recently proposed interactive segmentation foundation models (b) relying on real-time human interventions, Segment Anything by Text (SAT) directly takes 3D volumes as inputs, and use text as prompts to perform a wide array of medical image segmentation tasks across different modalities, anatomies, and body regions (c). It can be easily applied to clinics or seamlessly integrated with any agent-based large language model.
Figure 1: Architecture details of SAT. (a) Vision encoder and decoder follow a 6-layer U-Net architecture; (b) Text encoder is a 12-layer BERT model; (c) Query decoder consists of 6 transformer decoder layers.
Figure 2: Overview of SAT-DS, comprising diverse segmentation tasks spanning multiple imaging modalities and anatomical regions, including the brain, head and neck, thorax, spine, abdomen, upper limbs, lower limbs, and pelvis. This comprehensive dataset enables the training of a large-vocabulary segmentation foundation model.
Figure 2: Workflow details of SAT. SAT take 3D radiology images as input and can be prompted by an arbitrary number of terminologies in text form. Binary segmentation prediction is generated for each prompt. Key variables and their dimensional information are annotated on the figure.
Figure 3: Internal evaluation between SAT-Pro, SAT-Nano, and three specialist models on 72 datasets from SAT-DS. Results are merged by different human body regions and lesions. a, Box plots on DSC and NSD results. The center line within each box indicates the median value; the bottom and top bound indicate the 25th and 75th percentiles respectively. The mean value is marked with a plus sign. The whiskers extend to 1.5 times the interquartile range. Outlier classes are plotted as individual dots. b, Comparison between SAT-Pro and the most competitive specialist models nnU-Nets on performance. c, Comparison between SAT and specialist models on model size and capability range. SAT has much smaller model size compared to the ensemble of specialist models, while capable of segmenting 497 targets in one model. By comparison, each specialist model can only segment 12 targets on average.
...and 14 more figures

Large-Vocabulary Segmentation for Medical Images with Text Prompts

TL;DR

Abstract

Large-Vocabulary Segmentation for Medical Images with Text Prompts

Authors

TL;DR

Abstract

Table of Contents

Figures (19)