Table of Contents
Fetching ...

VideoSAM: Open-World Video Segmentation

Pinxue Guo, Zixu Zhao, Jianxiong Gao, Chongruo Wu, Tong He, Zheng Zhang, Tianjun Xiao, Wenqiang Zhang

TL;DR

VideoSAM is introduced, an end-to-end framework designed to address challenges by improving object tracking and segmentation consistency in dynamic environments by incorporating an autoregressive object-token mechanism within the SAM decoder to maintain consistent granularity across frames.

Abstract

Video segmentation is essential for advancing robotics and autonomous driving, particularly in open-world settings where continuous perception and object association across video frames are critical. While the Segment Anything Model (SAM) has excelled in static image segmentation, extending its capabilities to video segmentation poses significant challenges. We tackle two major hurdles: a) SAM's embedding limitations in associating objects across frames, and b) granularity inconsistencies in object segmentation. To this end, we introduce VideoSAM, an end-to-end framework designed to address these challenges by improving object tracking and segmentation consistency in dynamic environments. VideoSAM integrates an agglomerated backbone, RADIO, enabling object association through similarity metrics and introduces Cycle-ack-Pairs Propagation with a memory mechanism for stable object tracking. Additionally, we incorporate an autoregressive object-token mechanism within the SAM decoder to maintain consistent granularity across frames. Our method is extensively evaluated on the UVO and BURST benchmarks, and robotic videos from RoboTAP, demonstrating its effectiveness and robustness in real-world scenarios. All codes will be available.

VideoSAM: Open-World Video Segmentation

TL;DR

VideoSAM is introduced, an end-to-end framework designed to address challenges by improving object tracking and segmentation consistency in dynamic environments by incorporating an autoregressive object-token mechanism within the SAM decoder to maintain consistent granularity across frames.

Abstract

Video segmentation is essential for advancing robotics and autonomous driving, particularly in open-world settings where continuous perception and object association across video frames are critical. While the Segment Anything Model (SAM) has excelled in static image segmentation, extending its capabilities to video segmentation poses significant challenges. We tackle two major hurdles: a) SAM's embedding limitations in associating objects across frames, and b) granularity inconsistencies in object segmentation. To this end, we introduce VideoSAM, an end-to-end framework designed to address these challenges by improving object tracking and segmentation consistency in dynamic environments. VideoSAM integrates an agglomerated backbone, RADIO, enabling object association through similarity metrics and introduces Cycle-ack-Pairs Propagation with a memory mechanism for stable object tracking. Additionally, we incorporate an autoregressive object-token mechanism within the SAM decoder to maintain consistent granularity across frames. Our method is extensively evaluated on the UVO and BURST benchmarks, and robotic videos from RoboTAP, demonstrating its effectiveness and robustness in real-world scenarios. All codes will be available.

Paper Structure

This paper contains 16 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: VideoSAM produces open-world segmentation on videos with consistent object granularity. One color indicates one object.
  • Figure 2: Limitations of the Segment Anything Model (SAM) in extending to the open-world video segmentation task. (a) We conduct oracle experiments using ground-truth masks for each frame to associate objects across frames, utilizing DINOv2/SAM embeddings pooled from the masks on the OVIS qi2022occluded. We also perform linear probing experiments on ImageNet deng2009imagenet. These experiments demonstrate that while SAM embeddings are powerful for static image segmentation, they lack association and semantic information. (b) SAM exhibits inconsistent granularity when detecting objects across different frames, e.g., woman and cat.
  • Figure 3: Overview of VideoSAM. It simultaneously employs DINOv2 and SAM embeddings from an efficient agglomerated encoder to handle object association and segmentation, respectively. Cycle-ack Pair Propagation is introduced to robustly associate objects across frames. AR-SAM Decoder, adapted from the mask decoder of SAM with temporal autoregressive object prompts, is used to maintain consistent segmentation granularity across video frames.
  • Figure 4: Qualitative results of VideoSAM compared with the SAM baseline and DEVAcheng2023tracking on UVO wang2021unidentified and BURST athar2023burst datasets. The baseline prompts SAM with points propagated by feature similarity. In overall, VideoSAM reliably tracks objects and generates object masks with consistent granularity.
  • Figure 5: Qualitative performance on the RoboTAP vecerik2024robotap dataset.
  • ...and 1 more figures