3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation
Feiyu Pan, Hao Fang, Xiankai Lu
TL;DR
The paper tackles motion-expression guided RVOS, where domain gaps between video and language hinder cross-modal grounding. It proposes a frozen convolutional CLIP backbone to align vision and text features, augmented by three cross-modal interaction modules and a novel video query initialization to better initialize video queries. Key contributions include (i) preserving pre-trained VLM knowledge with reduced training cost, (ii) explicit cross-modal fusion through a cross-modal encoder, frame-query decoder, and video-query decoder, and (iii) a frame-to-video query aggregation strategy. On the MeViS dataset, the approach achieves 51.5 $\mathcal{J}$&$\mathcal{F}$ on the test set and ranks 3rd in the PVUW workshop, demonstrating robust motion-expression grounded segmentation with efficient training.
Abstract
Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap and reducing training costs. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information. Furthermore, we propose a novel video query initialization method to generate higher quality video queries. Without bells and whistles, our method achieved 51.5 J&F on the MeViS test set and ranked 3rd place for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation.
