Table of Contents
Fetching ...

Segment Anything for Videos: A Systematic Survey

Chunhui Zhang, Yawen Cui, Weilin Lin, Guanjie Huang, Yan Rong, Li Liu, Shiguang Shan

TL;DR

Segment Anything for Videos provides the first systematic survey of SAM and SAM 2 in the video domain, addressing the gap in video-centric reviews. It offers a structured taxonomy across three broad areas—video understanding, video generation, and video editing—and benchmarks representative methods against SOTA on standard datasets, highlighting both strengths (zero-shot generalization, promptability) and limitations (temporal coherence, domain adaptation). The work synthesizes current progress, compares SAM-based approaches to specialized video methods, and draws practical insights for selecting baselines and designing future studies. It also articulates concrete directions for scaling data and models, improving training efficiency, enriching modalities, and ensuring credible, interpretable video foundation models.

Abstract

The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (\eg, text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond.

Segment Anything for Videos: A Systematic Survey

TL;DR

Segment Anything for Videos provides the first systematic survey of SAM and SAM 2 in the video domain, addressing the gap in video-centric reviews. It offers a structured taxonomy across three broad areas—video understanding, video generation, and video editing—and benchmarks representative methods against SOTA on standard datasets, highlighting both strengths (zero-shot generalization, promptability) and limitations (temporal coherence, domain adaptation). The work synthesizes current progress, compares SAM-based approaches to specialized video methods, and draws practical insights for selecting baselines and designing future studies. It also articulates concrete directions for scaling data and models, improving training efficiency, enriching modalities, and ensuring credible, interpretable video foundation models.

Abstract

The recent wave of foundation models has witnessed tremendous success in computer vision (CV) and beyond, with the segment anything model (SAM) having sparked a passion for exploring task-agnostic visual foundation models. Empowered by its remarkable zero-shot generalization, SAM is currently challenging numerous traditional paradigms in CV, delivering extraordinary performance not only in various image segmentation and multi-modal segmentation (\eg, text-to-mask) tasks, but also in the video domain. Additionally, the latest released SAM 2 is once again sparking research enthusiasm in the realm of promptable visual segmentation for both images and videos. However, existing surveys mainly focus on SAM in various image processing tasks, a comprehensive and in-depth review in the video domain is notably absent. To address this gap, this work conducts a systematic review on SAM for videos in the era of foundation models. As the first to review the progress of SAM for videos, this work focuses on its applications to various tasks by discussing its recent advances, and innovation opportunities of developing foundation models on broad applications. We begin with a brief introduction to the background of SAM and video-related research domains. Subsequently, we present a systematic taxonomy that categorizes existing methods into three key areas: video understanding, video generation, and video editing, analyzing and summarizing their advantages and limitations. Furthermore, comparative results of SAM-based and current state-of-the-art methods on representative benchmarks, as well as insightful analysis are offered. Finally, we discuss the challenges faced by current research and envision several future research directions in the field of SAM for video and beyond.
Paper Structure (36 sections, 5 figures, 4 tables)

This paper contains 36 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Summarization on SAM-based works. (a) The number of SAM-related research works is rapidly increasing. (b) Video understanding dominates the research of SAM for videos.
  • Figure 2: Overall architectures of SAM (a) and SAM 2 (b) from the original papers ICCV2023SAMravi2024sam2, respectively. According to the user prompts, SAM and SAM 2 can achieve interactive segmentation in images and videos. Several representative research routes for the SAM and SAM 2 models ( e.g., model compression zhao2023fast, model robustness zhang2023attack), prompt ( e.g., efficient finetuning chen2023sam), and outputs ( e.g., innovative applications zhang2023segmentlai2023detect) are listed in (c).
  • Figure 3: Taxonomy of research works on SAM for videos. Due to space considerations, we merely list some representative methods for each video-related task here.
  • Figure 4: Concepts comparison of four prevalent visual segmentation tasks, including semantic, instance, panoptic, and entity segmentation. (a) For semantic segmentation, the same textures or categories are assigned the same class labels. (b) Instance segmentation only focuses on the foreground, and different objects in the same category are assigned different instance identities. (c) In panoptic segmentation, each pixel is assigned a semantic label and a unique instance identifier. (d) Entity segmentation qi2023high requires segmenting unseen categories in the training set, e.g., "tyre".
  • Figure 5: Examples of video masks generation results with (a) AVISeg guo2023audio and video editing results with (b) 2SVE wu2023cvpr.