Table of Contents
Fetching ...

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, Si Liu

TL;DR

This work addresses training-free multimodal video object segmentation for audio- and language-referenced tasks (AVS and RVOS) by introducing AL-Ref-SAM 2. It leverages GroundingDINO for initial object grounding and SAM 2 for segmentation, but mitigates context limitations with a GPT-4 guided two-step Pivot Selection (GPT-PS) for temporal-spatial reasoning and a Language-Binded Reference Unification (LBRU) that converts audio to language-form references. The approach unifies AVS and RVOS within a single pipeline and demonstrates competitive or superior performance to supervised methods on standard benchmarks, validated through comprehensive ablations and qualitative results. The work highlights the potential of combining foundation models with carefully designed prompting strategies to achieve training-free multimodal VOS with practical impact. The accompanying code is publicly available at the provided GitHub repository.

Abstract

In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation

TL;DR

This work addresses training-free multimodal video object segmentation for audio- and language-referenced tasks (AVS and RVOS) by introducing AL-Ref-SAM 2. It leverages GroundingDINO for initial object grounding and SAM 2 for segmentation, but mitigates context limitations with a GPT-4 guided two-step Pivot Selection (GPT-PS) for temporal-spatial reasoning and a Language-Binded Reference Unification (LBRU) that converts audio to language-form references. The approach unifies AVS and RVOS within a single pipeline and demonstrates competitive or superior performance to supervised methods on standard benchmarks, validated through comprehensive ablations and qualitative results. The work highlights the potential of combining foundation models with carefully designed prompting strategies to achieve training-free multimodal VOS with practical impact. The accompanying code is publicly available at the provided GitHub repository.

Abstract

In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.
Paper Structure (21 sections, 1 equation, 7 figures, 6 tables)

This paper contains 21 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a) For the training-free baseline, naively choosing the first frame to generate the object prompt may yield completely wrong predictions on the video since the first frame does not contain any relevant object. (b) Our method leverages GPT-4 to perform two-step temporal-spatial reasoning, selecting the frame and box that best reflects the reference information. The selected box serves as a more accurate object prompt to SAM 2 for better segmentation results.
  • Figure 2: The overall pipeline of our proposed Audio-Language-Referenced SAM 2. (a) Generate a language-formatted reference that specifies the objects to be segmented for both RVOS and AVS tasks. (b) Select the pivot frame and pivot box through two-step temporal-spatial reasoning. (c) Prompt SAM 2 with the selected pivot box to obtain the mask sequence of the target object across the entire video. The symbol $\times$ on $\mathbf{M}_{\rm{AVS}}$ represents the mask to be filtered out where the sound-emitting object is silent. Different colors are used to denote the data flow of RVOS and AVS tasks respectively.
  • Figure 3: Detailed illustration of our Language-Binded Reference Unification module.
  • Figure 4: Qualitative comparison between our method and the baseline GD-SAM 2 on the Ref-YouTube-VOS dataset.
  • Figure 5: Detailed prompts used in our AL-Ref-SAM 2 pipeline.
  • ...and 2 more figures