A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation

Yufan He; Pengfei Guo; Yucheng Tang; Andriy Myronenko; Vishwesh Nath; Ziyue Xu; Dong Yang; Can Zhao; Daguang Xu; Wenqi Li

A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation

Yufan He, Pengfei Guo, Yucheng Tang, Andriy Myronenko, Vishwesh Nath, Ziyue Xu, Dong Yang, Can Zhao, Daguang Xu, Wenqi Li

TL;DR

This paper evaluates Segment Anything 2 (SAM2) for zero-shot 3D CT segmentation and addresses the variability in prior benchmarks caused by different evaluation pipelines. It reproduces SAM2's eight-iteration interactive protocol on multiple 3D CT datasets and compares against established baselines such as VISTA3D, providing a standardized benchmarking framework and code. The study finds that SAM2 in zero-shot mode generates many false positives when foreground objects disappear and that adding more slices yields limited gains, with strong performance only for small, single-connected structures when background slices are removed. The authors conclude that zero-shot SAM2 is not yet satisfactory for 3D medical imaging and advocate finetuning or new 3D-aware approaches, delivering a reproducible protocol to guide future research.

Abstract

Since the release of Segment Anything 2 (SAM2), the medical imaging community has been actively evaluating its performance for 3D medical image segmentation. However, different studies have employed varying evaluation pipelines, resulting in conflicting outcomes that obscure a clear understanding of SAM2's capabilities and potential applications. We shortly review existing benchmarks and point out that the SAM2 paper clearly outlines a zero-shot evaluation pipeline, which simulates user clicks iteratively for up to eight iterations. We reproduced this interactive annotation simulation on 3D CT datasets and provided the results and code~\url{https://github.com/Project-MONAI/VISTA}. Our findings reveal that directly applying SAM2 on 3D medical imaging in a zero-shot manner is far from satisfactory. It is prone to generating false positives when foreground objects disappear, and annotating more slices cannot fully offset this tendency. For smaller single-connected objects like kidney and aorta, SAM2 performs reasonably well but for most organs it is still far behind state-of-the-art 3D annotation methods. More research and innovation are needed for 3D medical imaging community to use SAM2 correctly.

A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation

TL;DR

Abstract

Paper Structure (3 sections, 4 figures, 1 table)

This paper contains 3 sections, 4 figures, 1 table.

Introduction
Method
Results

Figures (4)

Figure 1: Example of segmenting spleen using SAM2 online-demo. The first line of figures are generated using one whole-volume propagation with one annotated slices (N), the second line are generated by another propagation with two annotated slices (N-1-M, N). The video is converted from a full MSD spleen nifti file where each slice is a frame. The initial click is on slice N but on slice N-1, the liver region is segmented, and the segmentation is completely wrong from slice 0 to slice N-1 since liver is segmented. Meanwhile, slice N+K has no spleen and SAM2's tracking started to segment heart. Slice N-1-M is selected to add negative points and all false positives are removed from Slice 0 to Slice N-1-M, however, those negative points have no effect on suppressing false positives starting from its next slice N-M.
Figure 2: Mean dice scores and 95% confidence interval with annotated slice numbers on MSD tasks
Figure 3: Mean dice scores and 95% confidence interval with annotated slice numbers on MSD task09 and AbdomenCT-1K.
Figure 4: Mean dice scores and 95% confidence interval with annotated slice numbers on BTCV.

A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation

TL;DR

Abstract

A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)