Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

Zikun Zhou; Wentao Xiong; Li Zhou; Xin Li; Zhenyu He; Yaowei Wang

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

Zikun Zhou, Wentao Xiong, Li Zhou, Xin Li, Zhenyu He, Yaowei Wang

TL;DR

This work addresses RVOS by transferring Vision-Language Pretrained (VLP) models to RVOS via temporal-aware adaptation, aligning VL features for pixel-level segmentation in videos. It introduces temporal-aware prompt-tuning to adapt representations and inject temporal context, a Cube-Frame Attention mechanism for efficient spatial-temporal reasoning, and a three-stage VL relation modeling pipeline (reference-guided encoding, VLFF fusion, and STR with shallow features). The approach freezes the VLP backbone and relies on learnable prompts and carefully designed attention modules to learn task-specific VL relations from limited video data, achieving state-of-the-art or competitive results across five benchmarks with strong generalization and real-time performance. The findings demonstrate that leveraging aligned VL spaces and prompt-based adaptation substantially improves robust cross-modal understanding in dynamic scenes, advancing practical RVOS systems.

Abstract

The crux of Referring Video Object Segmentation (RVOS) lies in modeling dense text-video relations to associate abstract linguistic concepts with dynamic visual contents at pixel-level. Current RVOS methods typically use vision and language models pretrained independently as backbones. As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language (VL) relation modeling from scratch. Witnessing the success of Vision-Language Pretrained (VLP) models, we propose to learn relation modeling for RVOS based on their aligned VL feature space. Nevertheless, transferring VLP models to RVOS is a deceptively challenging task due to the substantial gap between the pretraining task (static image/region-level prediction) and the RVOS task (dynamic pixel-level prediction). To address this transfer challenge, we introduce a framework named VLP-RVOS which harnesses VLP models for RVOS through temporal-aware adaptation. We first propose a temporal-aware prompt-tuning method, which not only adapts pretrained representations for pixel-level prediction but also empowers the vision encoder to model temporal contexts. We further customize a cube-frame attention mechanism for robust spatial-temporal reasoning. Besides, we propose to perform multi-stage VL relation modeling while and after feature extraction for comprehensive VL understanding. Extensive experiments demonstrate that our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

TL;DR

Abstract

Paper Structure (24 sections, 3 equations, 13 figures, 7 tables)

This paper contains 24 sections, 3 equations, 13 figures, 7 tables.

Introduction
Related Work
VLP-RVOS
Vision-Language (VL) encoders
Temporal-aware VL prompt-tuning
Temporal-aware vision prompt-tuning
Language prompt-tuning
Multi-stage VL relation modeling
Reference-guided visual encoding during feature extraction
VL feature fusion after feature extraction
VL relation modeling with shallow features
Spatial-temporal reasoning for RVOS
Experiments
Experimental settings
Ablation studies
...and 9 more sections

Figures (13)

Figure 1: Two paradigms of learning dense text-video relation modeling for RVOS. Compared with learning from scratch, learning such a relation modeling ability based on the aligned VL feature space is more accessible and derives better performance.
Figure 2: Comparison between using ViT-B/16 CLIP w/o and w/ temporal modeling. When the blue jeans disappear from view in the $140^{th}$ frame, our method can still understand that the person is the referred target according to the temporal clue, while the variant without temporal modeling cannot.
Figure 3: Comparison with state-of-the-art algorithms on Ref-DAVIS17 URVOS. We visualize $\mathcal{J}\&\mathcal{F}$w.r.t. the learnable params of different methods. Note that we freeze the VLP model. The circle size indicates the ratio of $\mathcal{J}\&\mathcal{F}$ to the learnable params.
Figure 4: Overall architecture of VLP-RVOS, which processes long videos clip-by-clip. The prompt tokens are first appended to the input VL tokens. Then the vision encoder extracts video features with the guidance of learnable vision/temporal prompts and historical prompts conditioned on the previous clip. The language encoder, tuned by learnable language prompts, extracts linguistic features. VL feature fusion and spatial-temporal reasoning modules associate linguistic concepts with corresponding dynamic visual contents. A segmentation head is used for final target segmentation. ①, ② and ③ mark the three VL relation modeling stages. MSA/MCA denotes multi-head self/cross-attention. LN is layer normalization. $\oplus$ is element-wise summation.
Figure 5: Illustration of our Parameter-Reusing Temporal Capture module. It reuses each transformer layer in the VLP model as the encoder and decoder to capture the temporal clue.
...and 8 more figures

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

TL;DR

Abstract

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)