Table of Contents
Fetching ...

Automatic Segmentation of 3D CT scans with SAM2 using a zero-shot approach

Miquel Lopez Escoriza, Pau Amargant Alvarez

Abstract

Foundation models for image segmentation have shown strong generalization in natural images, yet their applicability to 3D medical imaging remains limited. In this work, we study the zero-shot use of Segment Anything Model 2 (SAM2) for automatic segmentation of volumetric CT data, without any fine-tuning or domain-specific training. We analyze how SAM2 should be applied to CT volumes and identify its main limitation: the lack of inherent volumetric awareness. To address this, we propose a set of inference-alone architectural and procedural modifications that adapt SAM2's video-based memory mechanism to 3D data by treating CT slices as ordered sequences. We conduct a systematic ablation study on a subset of 500 CT scans from the TotalSegmentator dataset to evaluate prompt strategies, memory propagation schemes and multi-pass refinement. Based on these findings, we select the best-performing configuration and report final results on a bigger sample of the TotalSegmentator dataset comprising 2,500 CT scans. Our results show that, even with frozen weights, SAM2 can produce coherent 3D segmentations when its inference pipeline is carefully structured, demonstrating the feasibility of a fully zero-shot approach for volumetric medical image segmentation.

Automatic Segmentation of 3D CT scans with SAM2 using a zero-shot approach

Abstract

Foundation models for image segmentation have shown strong generalization in natural images, yet their applicability to 3D medical imaging remains limited. In this work, we study the zero-shot use of Segment Anything Model 2 (SAM2) for automatic segmentation of volumetric CT data, without any fine-tuning or domain-specific training. We analyze how SAM2 should be applied to CT volumes and identify its main limitation: the lack of inherent volumetric awareness. To address this, we propose a set of inference-alone architectural and procedural modifications that adapt SAM2's video-based memory mechanism to 3D data by treating CT slices as ordered sequences. We conduct a systematic ablation study on a subset of 500 CT scans from the TotalSegmentator dataset to evaluate prompt strategies, memory propagation schemes and multi-pass refinement. Based on these findings, we select the best-performing configuration and report final results on a bigger sample of the TotalSegmentator dataset comprising 2,500 CT scans. Our results show that, even with frozen weights, SAM2 can produce coherent 3D segmentations when its inference pipeline is carefully structured, demonstrating the feasibility of a fully zero-shot approach for volumetric medical image segmentation.
Paper Structure (19 sections, 2 equations, 5 figures, 7 tables)

This paper contains 19 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Adaptation of SAM2 to 3D CT volumes. A CT scan is interpreted as a pseudo-video by treating the depth ($z$) axis as the temporal dimension. At each slice $t$, the input image is processed by the image encoder to produce an embedding. Embeddings from prompted slices (first$\&$last frame in the figure) and a fixed number of neighboring slices are stored in the memory bank and augmented with learned temporal embeddings. Prompt-conditioned frames are assigned a zero temporal offset, while non-conditioned frames use learned temporal embeddings. The memory attention module attends to a fixed window of six frames, producing a fused representation that is passed to the mask decoder to generate the segmentation mask for slice $t$.
  • Figure 2: Multi-axis propagation and fusion strategy. A 3D CT volume is segmented independently along the axial, sagittal, and coronal axes using SAM2. Each axis produces a logit volume, which are reoriented to a common reference frame and merged to obtain the final segmentation.
  • Figure 3: Ablation study on memory design and propagation strategies. (a) Effect of the prompt-conditioned memory threshold, showing improved performance when only nearby prompted slices are retained. (b) (i) Influence of non-conditioned memory size on performance, and (ii) cosine similarity of temporal embeddings, highlighting redundancy among intermediate past frames. (c) Per-category segmentation performance, illustrating variability across anatomical structures.(d) Performance comparison of different volume propagation strategies under varying prompt budgets, including Forward–Backward and Three-Axis variants.
  • Figure 4: Comparative Runtime analysis across different configurations. Evaluation was performed on a set of 50 bones.
  • Figure 5: Vertebra segmentation produced by the baseline model. The proposed configuration (IS), and the ground truth. Each triplet corresponds to a slice of the CT volume. Vertebrae exhibit strong slice-to-slice appearance changes, making consistent tracking across the volume challenging. The proposed method produces more consistent segmentations across slices and successfully tracks the correct vertebra throughout the volume. While some predicted masks remain imprecise, the target structure is preserved. In contrast, the baseline model frequently segments incorrect objects (e.g., slices 100 and 111), likely due to confusion with visually similar structures in distant slices that are included in its memory.