Table of Contents
Fetching ...

3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation

Feiyu Pan, Hao Fang, Xiankai Lu

TL;DR

The paper tackles motion-expression guided RVOS, where domain gaps between video and language hinder cross-modal grounding. It proposes a frozen convolutional CLIP backbone to align vision and text features, augmented by three cross-modal interaction modules and a novel video query initialization to better initialize video queries. Key contributions include (i) preserving pre-trained VLM knowledge with reduced training cost, (ii) explicit cross-modal fusion through a cross-modal encoder, frame-query decoder, and video-query decoder, and (iii) a frame-to-video query aggregation strategy. On the MeViS dataset, the approach achieves 51.5 $\mathcal{J}$&$\mathcal{F}$ on the test set and ranks 3rd in the PVUW workshop, demonstrating robust motion-expression grounded segmentation with efficient training.

Abstract

Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap and reducing training costs. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information. Furthermore, we propose a novel video query initialization method to generate higher quality video queries. Without bells and whistles, our method achieved 51.5 J&F on the MeViS test set and ranked 3rd place for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation.

3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation

TL;DR

The paper tackles motion-expression guided RVOS, where domain gaps between video and language hinder cross-modal grounding. It proposes a frozen convolutional CLIP backbone to align vision and text features, augmented by three cross-modal interaction modules and a novel video query initialization to better initialize video queries. Key contributions include (i) preserving pre-trained VLM knowledge with reduced training cost, (ii) explicit cross-modal fusion through a cross-modal encoder, frame-query decoder, and video-query decoder, and (iii) a frame-to-video query aggregation strategy. On the MeViS dataset, the approach achieves 51.5 & on the test set and ranks 3rd in the PVUW workshop, demonstrating robust motion-expression grounded segmentation with efficient training.

Abstract

Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap and reducing training costs. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information. Furthermore, we propose a novel video query initialization method to generate higher quality video queries. Without bells and whistles, our method achieved 51.5 J&F on the MeViS test set and ranked 3rd place for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation.
Paper Structure (13 sections, 2 equations, 3 figures, 2 tables)

This paper contains 13 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparisons of LMPM and our Framework.
  • Figure 2: The overview architecture of the proposed method. This model inputs multiple images and motion expression from video clips, and outputs multi-scale visual features and sentence-level text features through a frozen CLIP backbone. Cross-Modal Encoder fuses text and image features, Frame Query decoder independently generates frame queries for each frame. Then, Video Query Decoder reorder and fuses all frame queries to adaptively initialize video queries. Finally, Video Query Decoder refines video queries for final mask prediction.
  • Figure 3: Illustration of Cross-modal Encoder, Frame Query Decoder, Video Query Initializer and Video Query Decoder.