Table of Contents
Fetching ...

Towards a Comprehensive, Efficient and Promptable Anatomic Structure Segmentation Model using 3D Whole-body CT Scans

Heng Guo, Jianfeng Zhang, Jiaxing Huang, Tony C. W. Mok, Dazhou Guo, Ke Yan, Le Lu, Dakai Jin, Minfeng Xu

TL;DR

This work tackles the challenge of extending segment-anything style models to 3D medical segmentation by introducing CT-SAM3D, a fully 3D promptable model trained from scratch on a expanded whole-body CT dataset (TotalSeg++). It introduces two core innovations, PSAP for efficient, spatially aligned 3D prompts and CPP for cross-patch context, enabling scalable interactive segmentation across hundreds of anatomies. CT-SAM3D achieves state-of-the-art performance among SAM-derived models on internal and external CT datasets, with strong zero-shot tumor segmentation capability and improved efficiency through a quasi-real-time 3D interactive tool. The combination of a comprehensive labeled dataset, advanced 3D prompt encoding, and interactive tooling promises practical clinical impact for rapid, accurate whole-body CT segmentation with reduced annotation workload.

Abstract

Segment anything model (SAM) demonstrates strong generalization ability on natural image segmentation. However, its direct adaptation in medical image segmentation tasks shows significant performance drops. It also requires an excessive number of prompt points to obtain a reasonable accuracy. Although quite a few studies explore adapting SAM into medical image volumes, the efficiency of 2D adaptation methods is unsatisfactory and 3D adaptation methods are only capable of segmenting specific organs/tumors. In this work, we propose a comprehensive and scalable 3D SAM model for whole-body CT segmentation, named CT-SAM3D. Instead of adapting SAM, we propose a 3D promptable segmentation model using a (nearly) fully labeled CT dataset. To train CT-SAM3D effectively, ensuring the model's accurate responses to higher-dimensional spatial prompts is crucial, and 3D patch-wise training is required due to GPU memory constraints. Therefore, we propose two key technical developments: 1) a progressively and spatially aligned prompt encoding method to effectively encode click prompts in local 3D space; and 2) a cross-patch prompt scheme to capture more 3D spatial context, which is beneficial for reducing the editing workloads when interactively prompting on large organs. CT-SAM3D is trained using a curated dataset of 1204 CT scans containing 107 whole-body anatomies and extensively validated using five datasets, achieving significantly better results against all previous SAM-derived models. Code, data, and our 3D interactive segmentation tool with quasi-real-time responses are available at https://github.com/alibaba-damo-academy/ct-sam3d.

Towards a Comprehensive, Efficient and Promptable Anatomic Structure Segmentation Model using 3D Whole-body CT Scans

TL;DR

This work tackles the challenge of extending segment-anything style models to 3D medical segmentation by introducing CT-SAM3D, a fully 3D promptable model trained from scratch on a expanded whole-body CT dataset (TotalSeg++). It introduces two core innovations, PSAP for efficient, spatially aligned 3D prompts and CPP for cross-patch context, enabling scalable interactive segmentation across hundreds of anatomies. CT-SAM3D achieves state-of-the-art performance among SAM-derived models on internal and external CT datasets, with strong zero-shot tumor segmentation capability and improved efficiency through a quasi-real-time 3D interactive tool. The combination of a comprehensive labeled dataset, advanced 3D prompt encoding, and interactive tooling promises practical clinical impact for rapid, accurate whole-body CT segmentation with reduced annotation workload.

Abstract

Segment anything model (SAM) demonstrates strong generalization ability on natural image segmentation. However, its direct adaptation in medical image segmentation tasks shows significant performance drops. It also requires an excessive number of prompt points to obtain a reasonable accuracy. Although quite a few studies explore adapting SAM into medical image volumes, the efficiency of 2D adaptation methods is unsatisfactory and 3D adaptation methods are only capable of segmenting specific organs/tumors. In this work, we propose a comprehensive and scalable 3D SAM model for whole-body CT segmentation, named CT-SAM3D. Instead of adapting SAM, we propose a 3D promptable segmentation model using a (nearly) fully labeled CT dataset. To train CT-SAM3D effectively, ensuring the model's accurate responses to higher-dimensional spatial prompts is crucial, and 3D patch-wise training is required due to GPU memory constraints. Therefore, we propose two key technical developments: 1) a progressively and spatially aligned prompt encoding method to effectively encode click prompts in local 3D space; and 2) a cross-patch prompt scheme to capture more 3D spatial context, which is beneficial for reducing the editing workloads when interactively prompting on large organs. CT-SAM3D is trained using a curated dataset of 1204 CT scans containing 107 whole-body anatomies and extensively validated using five datasets, achieving significantly better results against all previous SAM-derived models. Code, data, and our 3D interactive segmentation tool with quasi-real-time responses are available at https://github.com/alibaba-damo-academy/ct-sam3d.
Paper Structure (16 sections, 2 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 16 sections, 2 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of the enhanced TotalSeg++ dataset and the versatile 3D promptable CT-SAM3D model. TotalSeg++ complements TotalSeg dataset with added skeletal muscle, visceral and subcutaneous fat annotations.
  • Figure 2: (A) Framework of CT-SAM3D. (B) Details of progressively and spatially aligned prompt. (C) Cross-patch prompt training scheme. (D) Inference on large organs via cross-patch prompting on $\mathbf{N}_j (j\in [1, 26])$, which are the nearest neighbors around the selected patch.
  • Figure 3: Grouped boxplot of different methods. "CT-SAM3D*" (lemon color) denotes degraded results when trained on TotalSeg. The $p$-values are presented above the boxes.
  • Figure 4: Results under increasing number of clicks on FLARE22.
  • Figure 5: Qualitative results of different methods on a subject who exhibits severe renal pathology (green region). The first row is an axial slice, the second row is a coronal slice, and the last row shows the 3D volume rendering. DSC (%) scores are mentioned for each method.
  • ...and 5 more figures