Table of Contents
Fetching ...

SegVol: Universal and Interactive Volumetric Medical Image Segmentation

Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao

TL;DR

SegVol introduces a universal, interactive foundation model for 3D volumetric medical image segmentation, enabling semantic and spatial prompts across 200+ anatomical targets. It combines a 3D Vision Transformer pre-trained with SimMIM on tens of thousands of CT volumes and a frozen CLIP-based text encoder to achieve cross-dataset generalization, aided by a zoom-out-zoom-in inference workflow and a robust pseudo-label strategy. Across 22 segmentation tasks, SegVol outperforms SAM-like interactive methods in most cases, with substantial gains on challenging lesions and organs, and shows strong scalability with more data and prompts. The work demonstrates practical impact for clinical workflows by facilitating precise, interactive 3D segmentation and suggests directions for multi-modality and referring-segmentation extensions in future research.

Abstract

Precise image segmentation provides clinical study with instructive information. Despite the remarkable progress achieved in medical image segmentation, there is still an absence of a 3D foundation segmentation model that can segment a wide range of anatomical categories with easy user interaction. In this paper, we propose a 3D foundation segmentation model, named SegVol, supporting universal and interactive volumetric medical image segmentation. By scaling up training data to 90K unlabeled Computed Tomography (CT) volumes and 6K labeled CT volumes, this foundation model supports the segmentation of over 200 anatomical categories using semantic and spatial prompts. To facilitate efficient and precise inference on volumetric images, we design a zoom-out-zoom-in mechanism. Extensive experiments on 22 anatomical segmentation tasks verify that SegVol outperforms the competitors in 19 tasks, with improvements up to 37.24% compared to the runner-up methods. We demonstrate the effectiveness and importance of specific designs by ablation study. We expect this foundation model can promote the development of volumetric medical image analysis. The model and code are publicly available at: https://github.com/BAAI-DCAI/SegVol.

SegVol: Universal and Interactive Volumetric Medical Image Segmentation

TL;DR

SegVol introduces a universal, interactive foundation model for 3D volumetric medical image segmentation, enabling semantic and spatial prompts across 200+ anatomical targets. It combines a 3D Vision Transformer pre-trained with SimMIM on tens of thousands of CT volumes and a frozen CLIP-based text encoder to achieve cross-dataset generalization, aided by a zoom-out-zoom-in inference workflow and a robust pseudo-label strategy. Across 22 segmentation tasks, SegVol outperforms SAM-like interactive methods in most cases, with substantial gains on challenging lesions and organs, and shows strong scalability with more data and prompts. The work demonstrates practical impact for clinical workflows by facilitating precise, interactive 3D segmentation and suggests directions for multi-modality and referring-segmentation extensions in future research.

Abstract

Precise image segmentation provides clinical study with instructive information. Despite the remarkable progress achieved in medical image segmentation, there is still an absence of a 3D foundation segmentation model that can segment a wide range of anatomical categories with easy user interaction. In this paper, we propose a 3D foundation segmentation model, named SegVol, supporting universal and interactive volumetric medical image segmentation. By scaling up training data to 90K unlabeled Computed Tomography (CT) volumes and 6K labeled CT volumes, this foundation model supports the segmentation of over 200 anatomical categories using semantic and spatial prompts. To facilitate efficient and precise inference on volumetric images, we design a zoom-out-zoom-in mechanism. Extensive experiments on 22 anatomical segmentation tasks verify that SegVol outperforms the competitors in 19 tasks, with improvements up to 37.24% compared to the runner-up methods. We demonstrate the effectiveness and importance of specific designs by ablation study. We expect this foundation model can promote the development of volumetric medical image analysis. The model and code are publicly available at: https://github.com/BAAI-DCAI/SegVol.
Paper Structure (36 sections, 4 equations, 18 figures, 11 tables, 1 algorithm)

This paper contains 36 sections, 4 equations, 18 figures, 11 tables, 1 algorithm.

Figures (18)

  • Figure 1: Overview of SegVol model architecture. SegVol produces precise segmentation of 3D anatomical structures from volumetric inputs with easy user interactions, including point, bounding box, and text prompts. Zoom-out-zoom-in mechanism: SegVol initially produces a rough prediction mask with zoom-out inference, then refines it with zoom-in inference on the identified ROI.
  • Figure 2: Violin plots for quantitative comparison experiment results of SegVol and SAM-like interactive methodskirillov2023segmentcheng2023samwang2023sammed3dma2023segment. The vertical axis represents the Dice score.
  • Figure 3: (a) The performance of SegVol improves as the training data scales up. (b) The quantitative experimental results on 19 anatomical segmentation tasks of split 20% test data demonstrate that using the combination of semantic and spatial prompts can achieve better performances.
  • Figure 4: The four cases demonstrate that semantic-prompt can clarify the ambiguity of spatial-prompt and avoid multi-plausible outputs. Each image shows the segmentation result of SegVol using the spatial-prompt, i.e. point or bounding box, and semantic-prompt, i.e. the caption below the image.
  • Figure 5: We identify the semantic categories of the spatial-prompt segmentation results. Each image shows the spatial-prompt and the mask prediction. The bar charts rank the top 8 semantic categories with the highest classification probabilities. The results show that SegVol is capable of identifying the anatomical category of the segmentation mask using spatial prompts.
  • ...and 13 more figures