Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation
Shaofei Huang, Rui Ling, Hongyu Li, Tianrui Hui, Zongheng Tang, Xiaoming Wei, Jizhong Han, Si Liu
TL;DR
This work addresses training-free multimodal video object segmentation for audio- and language-referenced tasks (AVS and RVOS) by introducing AL-Ref-SAM 2. It leverages GroundingDINO for initial object grounding and SAM 2 for segmentation, but mitigates context limitations with a GPT-4 guided two-step Pivot Selection (GPT-PS) for temporal-spatial reasoning and a Language-Binded Reference Unification (LBRU) that converts audio to language-form references. The approach unifies AVS and RVOS within a single pipeline and demonstrates competitive or superior performance to supervised methods on standard benchmarks, validated through comprehensive ablations and qualitative results. The work highlights the potential of combining foundation models with carefully designed prompting strategies to achieve training-free multimodal VOS with practical impact. The accompanying code is publicly available at the provided GitHub repository.
Abstract
In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.
