Table of Contents
Fetching ...

Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

Ci-Siang Lin, Min-Hung Chen, I-Jieh Liu, Chien-Yi Wang, Sifei Liu, Yu-Chiang Frank Wang

TL;DR

This paper addresses Referring Video Object Segmentation (RVOS) by decoupling the problem into referring, video, and segmentation factors and proposing Tenet, a temporal-prompt framework that adapts image-based foundation models to video-language segmentation. Tenet generates a reference proposal and multiple candidate tracks using detectors and trackers, then employs a Prompt Preference Learning module to select the most informative temporal prompt to feed a segmentation foundation model. The approach demonstrates strong results on Refer-Youtube-VOS and Refer-DAVIS17 with a compact training footprint (~45M parameters) and improved efficiency versus end-to-end vision-language methods. Overall, the work shows that temporally aware prompts can effectively leverage foundation segmentation models for RVOS without requiring dense mask annotations, enabling scalable and adaptable video understanding with language grounding.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.

Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

TL;DR

This paper addresses Referring Video Object Segmentation (RVOS) by decoupling the problem into referring, video, and segmentation factors and proposing Tenet, a temporal-prompt framework that adapts image-based foundation models to video-language segmentation. Tenet generates a reference proposal and multiple candidate tracks using detectors and trackers, then employs a Prompt Preference Learning module to select the most informative temporal prompt to feed a segmentation foundation model. The approach demonstrates strong results on Refer-Youtube-VOS and Refer-DAVIS17 with a compact training footprint (~45M parameters) and improved efficiency versus end-to-end vision-language methods. Overall, the work shows that temporally aware prompts can effectively leverage foundation segmentation models for RVOS without requiring dense mask annotations, enabling scalable and adaptable video understanding with language grounding.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.

Paper Structure

This paper contains 23 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Given the expression, we first generate temporal prompts as the reference proposal and candidate tracks. Our proposed Tenet framework then selects the one that best aligns with the expression to prompt SAM, achieving referring video object segmentation.
  • Figure 2: Qualitative results when taking visual prompts derived from different methods to prompt SAM on the Refer-DAVIS$_{17}$ dataset. Note that the score in the left is the box mIoU.
  • Figure 3: Overview of the proposed Tenet framework. We first produce the reference proposal and candidate tracks as described in Section \ref{['sec:generation']}, and then perform Prompt Preference Learning as detailed in Section \ref{['sec:selection']}.
  • Figure 4: Qualitative results on Ref-YouTube-VOS.
  • Figure 5: Qualitative results on Ref-DAVIS17.