Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation
Zikun Zhou, Wentao Xiong, Li Zhou, Xin Li, Zhenyu He, Yaowei Wang
TL;DR
This work addresses RVOS by transferring Vision-Language Pretrained (VLP) models to RVOS via temporal-aware adaptation, aligning VL features for pixel-level segmentation in videos. It introduces temporal-aware prompt-tuning to adapt representations and inject temporal context, a Cube-Frame Attention mechanism for efficient spatial-temporal reasoning, and a three-stage VL relation modeling pipeline (reference-guided encoding, VLFF fusion, and STR with shallow features). The approach freezes the VLP backbone and relies on learnable prompts and carefully designed attention modules to learn task-specific VL relations from limited video data, achieving state-of-the-art or competitive results across five benchmarks with strong generalization and real-time performance. The findings demonstrate that leveraging aligned VL spaces and prompt-based adaptation substantially improves robust cross-modal understanding in dynamic scenes, advancing practical RVOS systems.
Abstract
The crux of Referring Video Object Segmentation (RVOS) lies in modeling dense text-video relations to associate abstract linguistic concepts with dynamic visual contents at pixel-level. Current RVOS methods typically use vision and language models pretrained independently as backbones. As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language (VL) relation modeling from scratch. Witnessing the success of Vision-Language Pretrained (VLP) models, we propose to learn relation modeling for RVOS based on their aligned VL feature space. Nevertheless, transferring VLP models to RVOS is a deceptively challenging task due to the substantial gap between the pretraining task (static image/region-level prediction) and the RVOS task (dynamic pixel-level prediction). To address this transfer challenge, we introduce a framework named VLP-RVOS which harnesses VLP models for RVOS through temporal-aware adaptation. We first propose a temporal-aware prompt-tuning method, which not only adapts pretrained representations for pixel-level prediction but also empowers the vision encoder to model temporal contexts. We further customize a cube-frame attention mechanism for robust spatial-temporal reasoning. Besides, we propose to perform multi-stage VL relation modeling while and after feature extraction for comprehensive VL understanding. Extensive experiments demonstrate that our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
