Table of Contents
Fetching ...

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

Feiyu Pan, Hao Fang, Runmin Cong, Wei Zhang, Xiankai Lu

TL;DR

The paper investigates zero-shot semi-supervised video object segmentation on challenging MOSE and LVOS datasets using SAM 2, a promptable segmentation model with streaming memory. It articulates SAM 2’s memory-augmented transformer architecture, where frame embeddings are conditioned on past memories via a memory attention mechanism and a lightweight mask decoder. In experiments, SAM 2 achieves a validation mean score of $J$ and $F$ of 73.89% and 75.79% on the test set, respectively, ranking 4th in the 6th LSVOS Challenge VOS Track, and outperforms a finetuned Cutie baseline on the val set. The results underscore SAM 2’s strong zero-shot VOS capability and establish a solid memory-based, promptable baseline for future VOS research in streaming video settings.

Abstract

Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.

Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track

TL;DR

The paper investigates zero-shot semi-supervised video object segmentation on challenging MOSE and LVOS datasets using SAM 2, a promptable segmentation model with streaming memory. It articulates SAM 2’s memory-augmented transformer architecture, where frame embeddings are conditioned on past memories via a memory attention mechanism and a lightweight mask decoder. In experiments, SAM 2 achieves a validation mean score of and of 73.89% and 75.79% on the test set, respectively, ranking 4th in the 6th LSVOS Challenge VOS Track, and outperforms a finetuned Cutie baseline on the val set. The results underscore SAM 2’s strong zero-shot VOS capability and establish a solid memory-based, promptable baseline for future VOS research in streaming video settings.

Abstract

Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
Paper Structure (7 sections, 2 figures, 2 tables)

This paper contains 7 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The SAM 2 architecture. For a given frame, the segmentation prediction is conditioned on the current prompt and/or on previously observed memories. Videos are processed in a streaming fashion with frames being consumed one at a time by the image encoder, and cross-attended to memories of the target object from previous frames. The mask decoder, which optionally also takes input prompts, predicts the segmentation mask for that frame. Finally, a memory encoder transforms the prediction and image encoder embeddings (not shown in the figure) for use in future frames.
  • Figure 2: Qualitative comparison of Cutie and SAM 2 on validation dataset.