Table of Contents
Fetching ...

Training-Free Robust Interactive Video Object Segmentation

Xiaoli Wei, Zhaoqing Wang, Yandong Guo, Chunxia Zhang, Tongliang Liu, Mingming Gong

TL;DR

The paper addresses interactive video object segmentation (IVOS) across diverse domains by removing the need for task-specific training. It introduces I-PT, a training-free framework that combines prompt-based interaction with Segment Anything Model (SAM) and a Cross-Round Space-Time Module (CRSTM) to propagate memory across rounds and frames. By jointly tracking query points and boxes, and leveraging CRSTM for adaptive memory readout, I-PT achieves robust zero-shot segmentation on DAVIS 2017, YouTube-VOS 2018, and MOSE 2023 while maintaining a favorable interaction time. The approach demonstrates that SAM-based IVOS can generalize effectively without training, offering a scalable foundational framework for interactive video annotation and data labeling.

Abstract

Interactive video object segmentation is a crucial video task, having various applications from video editing to data annotating. However, current approaches struggle to accurately segment objects across diverse domains. Recently, Segment Anything Model (SAM) introduces interactive visual prompts and demonstrates impressive performance across different domains. In this paper, we propose a training-free prompt tracking framework for interactive video object segmentation (I-PT), leveraging the powerful generalization of SAM. Although point tracking efficiently captures the pixel-wise information of objects in a video, points tend to be unstable when tracked over a long period, resulting in incorrect segmentation. Towards fast and robust interaction, we jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. To better integrate reference information from multiple interactions, we introduce a cross-round space-time module (CRSTM), which adaptively aggregates mask features from previous rounds and frames, enhancing the segmentation stability. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets with interaction types, including DAVIS 2017, YouTube-VOS 2018, and MOSE 2023, maintaining a good tradeoff between performance and interaction time.

Training-Free Robust Interactive Video Object Segmentation

TL;DR

The paper addresses interactive video object segmentation (IVOS) across diverse domains by removing the need for task-specific training. It introduces I-PT, a training-free framework that combines prompt-based interaction with Segment Anything Model (SAM) and a Cross-Round Space-Time Module (CRSTM) to propagate memory across rounds and frames. By jointly tracking query points and boxes, and leveraging CRSTM for adaptive memory readout, I-PT achieves robust zero-shot segmentation on DAVIS 2017, YouTube-VOS 2018, and MOSE 2023 while maintaining a favorable interaction time. The approach demonstrates that SAM-based IVOS can generalize effectively without training, offering a scalable foundational framework for interactive video annotation and data labeling.

Abstract

Interactive video object segmentation is a crucial video task, having various applications from video editing to data annotating. However, current approaches struggle to accurately segment objects across diverse domains. Recently, Segment Anything Model (SAM) introduces interactive visual prompts and demonstrates impressive performance across different domains. In this paper, we propose a training-free prompt tracking framework for interactive video object segmentation (I-PT), leveraging the powerful generalization of SAM. Although point tracking efficiently captures the pixel-wise information of objects in a video, points tend to be unstable when tracked over a long period, resulting in incorrect segmentation. Towards fast and robust interaction, we jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. To better integrate reference information from multiple interactions, we introduce a cross-round space-time module (CRSTM), which adaptively aggregates mask features from previous rounds and frames, enhancing the segmentation stability. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets with interaction types, including DAVIS 2017, YouTube-VOS 2018, and MOSE 2023, maintaining a good tradeoff between performance and interaction time.
Paper Structure (25 sections, 6 equations, 5 figures, 4 tables)

This paper contains 25 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Query point and box selection. Positive points are represented by circles, negative points by crosses, and different colors signify different objects. The edges of the target object are visualized to show where visual prompts are selected.
  • Figure 2: (a) Our proposed I-PT framework. Assuming the length of the video sequence is $N$. CRSTM stores information across frames at an interval of $d$. The interactive frames are denoted as $r$. (b) Multiple mask decoding iteration.
  • Figure 3: CRSTM architecture. It mainly contains three processes, i.e., memory updating, memory readout, and segmentation.
  • Figure 4: The curve of $\mathcal{J} \& \mathcal{F}$ versus interaction time and rounds. (a) Comparison of I-PT with some existing trained IVOS methods on DAVIS 2017 validation set pont20172017. (b) Performance differences within different prompts on YouTube-VOS 2018 xu2018youtube. (c) Performance differences across various configurations of the multiple mask decoding iteration on DAVIS 2017 validation set pont20172017.
  • Figure 5: Visualization of I-PT object segmentation after 3 interactive rounds, with white circles for tracked positive points, crosses for negative points, and colored boxes for distinct objects.