Table of Contents
Fetching ...

The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee

TL;DR

The paper argues that the jump from SAM2 to SAM3 is a fundamental paradigm shift from prompt-based geometric segmentation to concept-driven, multimodal segmentation. It details architectural changes, new training objectives, and the data requirements central to SAM3, explaining why SAM2 expertise does not transfer. By contrasting datasets, augmentation strategies, evaluation metrics, and failure modes, the authors establish SAM3 as a new class of segmentation foundation model with open-vocabulary and semantic grounding capabilities. The work highlights practical implications for data curation, benchmark design, and cross-domain deployment as the field moves toward a concept-driven segmentation era.

Abstract

This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.

The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation

TL;DR

The paper argues that the jump from SAM2 to SAM3 is a fundamental paradigm shift from prompt-based geometric segmentation to concept-driven, multimodal segmentation. It details architectural changes, new training objectives, and the data requirements central to SAM3, explaining why SAM2 expertise does not transfer. By contrasting datasets, augmentation strategies, evaluation metrics, and failure modes, the authors establish SAM3 as a new class of segmentation foundation model with open-vocabulary and semantic grounding capabilities. The work highlights practical implications for data curation, benchmark design, and cross-domain deployment as the field moves toward a concept-driven segmentation era.

Abstract

This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.

Paper Structure

This paper contains 25 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Timeline of Segment Anything models (SAM1 to SAM3) and their core capabilities. Solid boxes denote natively supported functionalities; dashed boxes denote capabilities that are unsupported or only achievable via external components.
  • Figure 2: (a) SAM2 architecture: a prompt-driven vision–temporal pipeline where segmentation depends on spatial prompts and memory retrieval across frames ; (b) SAM3 architecture: a multimodal vision-language system with a new detector enabling open-vocabulary, concept-level segmentation ;(c) SAM2 orchard workflow showing prompt-based apple segmentation without semantic understanding ; and (d) SAM3 workflow demonstrating text-prompted, concept-aware segmentation, identifying all relevant apple instances through multimodal fusion.
  • Figure 3: Compact comparison of SAM2 and SAM3 segmentation workflows. SAM2 follows a purely vision-based, spatially prompted pipeline, whereas SAM3 integrates text prompts, example images, and vision-language fusion to produce open-vocabulary, concept-level masks.
  • Figure 4: Mindmap summarizing the scientific reasons why expertise in SAM2 does not transfer to SAM3. The gap arises from differences in prompting, architecture, training objectives, datasets, and evaluation metrics.
  • Figure 5: SAM3 architecture, highlighting new multimodal components in yellow, inherited SAM2 ravi2024sam modules in blue, and the Perception Encoder bolya2025perception in cyan. The model integrates vision, text, geometry, and exemplar prompts through a dual encoder–decoder transformer, enabling concept-level segmentation beyond SAM2’s purely prompt-driven capabilities. Image sourced from carion2025sam
  • ...and 3 more figures