Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track
Feiyu Pan, Hao Fang, Runmin Cong, Wei Zhang, Xiankai Lu
TL;DR
The paper investigates zero-shot semi-supervised video object segmentation on challenging MOSE and LVOS datasets using SAM 2, a promptable segmentation model with streaming memory. It articulates SAM 2’s memory-augmented transformer architecture, where frame embeddings are conditioned on past memories via a memory attention mechanism and a lightweight mask decoder. In experiments, SAM 2 achieves a validation mean score of $J$ and $F$ of 73.89% and 75.79% on the test set, respectively, ranking 4th in the 6th LSVOS Challenge VOS Track, and outperforms a finetuned Cutie baseline on the val set. The results underscore SAM 2’s strong zero-shot VOS capability and establish a solid memory-based, promptable baseline for future VOS research in streaming video settings.
Abstract
Video Object Segmentation (VOS) task aims to segmenting a particular object instance throughout the entire video sequence given only the object mask of the first frame. Recently, Segment Anything Model 2 (SAM 2) is proposed, which is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. SAM 2 is a simple transformer architecture with streaming memory for real-time video processing, which trained on the date provides strong performance across a wide range of tasks. In this work, we evaluate the zero-shot performance of SAM 2 on the more challenging VOS datasets MOSE and LVOS. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
