Table of Contents
Fetching ...

SAM2 for Image and Video Segmentation: A Comprehensive Survey

Zhang Jiaxing, Tang Hao

TL;DR

This survey examines SAM2, an enhanced SAM variant, as a foundation-model-based approach for image and video segmentation. It analyzes SAM2’s architectural innovations (notably memory mechanisms) and its ability to deliver robust, real-time segmentation across static images and dynamic video, with a focus on cross-domain adaptation, medical imaging, and autonomous-driving contexts. The paper catalogues a wide spectrum of SAM- and SAM2-based methods, datasets (natural and medical), and evaluation metrics, and it discusses current challenges in domain adaptation, multimodal integration, and resource-efficient inference. It concludes with practical recommendations for fine-tuning, lightweight optimization, and broader multimodal interaction to unlock SAM2’s real-world impact.

Abstract

Despite significant advances in deep learning for image and video segmentation, existing models continue to face challenges in cross-domain adaptability and generalization. Image and video segmentation are fundamental tasks in computer vision with wide-ranging applications in healthcare, agriculture, industrial inspection, and autonomous driving. With the advent of large-scale foundation models, SAM2 - an improved version of SAM (Segment Anything Model)has been optimized for segmentation tasks, demonstrating enhanced performance in complex scenarios. However, SAM2's adaptability and limitations in specific domains require further investigation. This paper systematically analyzes the application of SAM2 in image and video segmentation and evaluates its performance in various fields. We begin by introducing the foundational concepts of image segmentation, categorizing foundation models, and exploring the technical characteristics of SAM and SAM2. Subsequently, we delve into SAM2's applications in static image and video segmentation, emphasizing its performance in specialized areas such as medical imaging and the challenges of cross-domain adaptability. As part of our research, we reviewed over 200 related papers to provide a comprehensive analysis of the topic. Finally, the paper highlights the strengths and weaknesses of SAM2 in segmentation tasks, identifies the technical challenges it faces, and proposes future development directions. This review provides valuable insights and practical recommendations for optimizing and applying SAM2 in real-world scenarios.

SAM2 for Image and Video Segmentation: A Comprehensive Survey

TL;DR

This survey examines SAM2, an enhanced SAM variant, as a foundation-model-based approach for image and video segmentation. It analyzes SAM2’s architectural innovations (notably memory mechanisms) and its ability to deliver robust, real-time segmentation across static images and dynamic video, with a focus on cross-domain adaptation, medical imaging, and autonomous-driving contexts. The paper catalogues a wide spectrum of SAM- and SAM2-based methods, datasets (natural and medical), and evaluation metrics, and it discusses current challenges in domain adaptation, multimodal integration, and resource-efficient inference. It concludes with practical recommendations for fine-tuning, lightweight optimization, and broader multimodal interaction to unlock SAM2’s real-world impact.

Abstract

Despite significant advances in deep learning for image and video segmentation, existing models continue to face challenges in cross-domain adaptability and generalization. Image and video segmentation are fundamental tasks in computer vision with wide-ranging applications in healthcare, agriculture, industrial inspection, and autonomous driving. With the advent of large-scale foundation models, SAM2 - an improved version of SAM (Segment Anything Model)has been optimized for segmentation tasks, demonstrating enhanced performance in complex scenarios. However, SAM2's adaptability and limitations in specific domains require further investigation. This paper systematically analyzes the application of SAM2 in image and video segmentation and evaluates its performance in various fields. We begin by introducing the foundational concepts of image segmentation, categorizing foundation models, and exploring the technical characteristics of SAM and SAM2. Subsequently, we delve into SAM2's applications in static image and video segmentation, emphasizing its performance in specialized areas such as medical imaging and the challenges of cross-domain adaptability. As part of our research, we reviewed over 200 related papers to provide a comprehensive analysis of the topic. Finally, the paper highlights the strengths and weaknesses of SAM2 in segmentation tasks, identifies the technical challenges it faces, and proposes future development directions. This review provides valuable insights and practical recommendations for optimizing and applying SAM2 in real-world scenarios.

Paper Structure

This paper contains 45 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: This image illustrates the evolution and categorization of the SAM (Segment Anything Model)/SAM2 and its derivatives. Different colors and positions are used to clearly represent each model along the timeline. In addition to compiling various SAM/SAM2 variants in the field of segmentation, including tasks such as shadow detection and classification, we place particular emphasis on the progression of segmentation tasks.
  • Figure 2: this image illustrates the hierarchical structure of visual models, progressing from foundational visual models to general segmentation models, and then to specialized segmentation models, reflecting an increasing level of specialization to meet the needs of more specific visual tasks.
  • Figure 3: Segment anything SAM
  • Figure 4: SAM2 SAM2