Table of Contents
Fetching ...

From SAM to SAM 2: Exploring Improvements in Meta's Segment Anything Model

Athulya Sundaresan Geetha, Muhammad Hussain

TL;DR

This work surveys the evolution from Segment Anything Model (SAM) to SAM 2, detailing how SAM achieves zero-shot image segmentation through a Vision Transformer image encoder, a prompt encoder, and a Transformer-based mask decoder trained on a massive SA-1B dataset. It then presents SAM 2, which introduces memory-based components (Memory Attention and Memory Bank) to extend segmentation to video with temporal coherence, trained on the SA-V dataset. The authors compare SAM and SAM 2, outlining annotation workflows and performance gains, including faster frame-level processing and improved cross-frame consistency. The study demonstrates the viability of memory-augmented segmentation for real-time, large-scale video understanding and highlights avenues for further advances in motion modeling and automated data annotation.

Abstract

The Segment Anything Model (SAM), introduced to the computer vision community by Meta in April 2023, is a groundbreaking tool that allows automated segmentation of objects in images based on prompts such as text, clicks, or bounding boxes. SAM excels in zero-shot performance, segmenting unseen objects without additional training, stimulated by a large dataset of over one billion image masks. SAM 2 expands this functionality to video, leveraging memory from preceding and subsequent frames to generate accurate segmentation across entire videos, enabling near real-time performance. This comparison shows how SAM has evolved to meet the growing need for precise and efficient segmentation in various applications. The study suggests that future advancements in models like SAM will be crucial for improving computer vision technology.

From SAM to SAM 2: Exploring Improvements in Meta's Segment Anything Model

TL;DR

This work surveys the evolution from Segment Anything Model (SAM) to SAM 2, detailing how SAM achieves zero-shot image segmentation through a Vision Transformer image encoder, a prompt encoder, and a Transformer-based mask decoder trained on a massive SA-1B dataset. It then presents SAM 2, which introduces memory-based components (Memory Attention and Memory Bank) to extend segmentation to video with temporal coherence, trained on the SA-V dataset. The authors compare SAM and SAM 2, outlining annotation workflows and performance gains, including faster frame-level processing and improved cross-frame consistency. The study demonstrates the viability of memory-augmented segmentation for real-time, large-scale video understanding and highlights avenues for further advances in motion modeling and automated data annotation.

Abstract

The Segment Anything Model (SAM), introduced to the computer vision community by Meta in April 2023, is a groundbreaking tool that allows automated segmentation of objects in images based on prompts such as text, clicks, or bounding boxes. SAM excels in zero-shot performance, segmenting unseen objects without additional training, stimulated by a large dataset of over one billion image masks. SAM 2 expands this functionality to video, leveraging memory from preceding and subsequent frames to generate accurate segmentation across entire videos, enabling near real-time performance. This comparison shows how SAM has evolved to meet the growing need for precise and efficient segmentation in various applications. The study suggests that future advancements in models like SAM will be crucial for improving computer vision technology.
Paper Structure (30 sections, 5 figures, 1 table)

This paper contains 30 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Segmentation example.
  • Figure 2: Architecture of Segment Anything Model rath2023segment
  • Figure 3: Architecture of Segment Anything Model 2 RN9
  • Figure 4: Image encoding RN10
  • Figure 5: Promptable Visual Segmentation RN10