Table of Contents
Fetching ...

The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang

TL;DR

This work tackles motion-expression guided video segmentation on the MeViS dataset by leveraging Sa2VA-based large multimodal models. The authors propose a simple yet effective inference optimization: (1) use Sa2VA as a unified baseline, (2) uniformly sample frames to supply broader video context to the LMM, and (3) ensemble multiple expert models to mitigate single-model errors. The method yields a strong result of $61.98\%$ $\mathcal{J} \& \mathcal{F}$ on the MeViS test set and ranks 1st in the PVUW 4th MeViS Track at CVPR 2025, demonstrating the practical value of frame-aware prompting and model fusion for motion-expression video segmentation. Overall, the approach exemplifies how careful inference design can maximize existing LMM capabilities for dense, pixel-level video understanding in challenging, motion-centric scenarios.

Abstract

Motion expression video segmentation is designed to segment objects in accordance with the input motion expressions. In contrast to the conventional Referring Video Object Segmentation (RVOS), it places emphasis on motion as well as multi-object expressions, making it more arduous. Recently, Large Multimodal Models (LMMs) have begun to shine in RVOS due to their powerful vision-language perception capabilities. In this work, we propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation. Firstly, we use Sa2VA as our baseline, which is a unified LMM for dense grounded understanding of both images and videos. Secondly, we uniformly sample the video frames during the inference process to enhance the model's understanding of the entire video. Finally, we integrate the results of multiple expert models to mitigate the erroneous predictions of a single model. Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.

The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation

TL;DR

This work tackles motion-expression guided video segmentation on the MeViS dataset by leveraging Sa2VA-based large multimodal models. The authors propose a simple yet effective inference optimization: (1) use Sa2VA as a unified baseline, (2) uniformly sample frames to supply broader video context to the LMM, and (3) ensemble multiple expert models to mitigate single-model errors. The method yields a strong result of on the MeViS test set and ranks 1st in the PVUW 4th MeViS Track at CVPR 2025, demonstrating the practical value of frame-aware prompting and model fusion for motion-expression video segmentation. Overall, the approach exemplifies how careful inference design can maximize existing LMM capabilities for dense, pixel-level video understanding in challenging, motion-centric scenarios.

Abstract

Motion expression video segmentation is designed to segment objects in accordance with the input motion expressions. In contrast to the conventional Referring Video Object Segmentation (RVOS), it places emphasis on motion as well as multi-object expressions, making it more arduous. Recently, Large Multimodal Models (LMMs) have begun to shine in RVOS due to their powerful vision-language perception capabilities. In this work, we propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation. Firstly, we use Sa2VA as our baseline, which is a unified LMM for dense grounded understanding of both images and videos. Secondly, we uniformly sample the video frames during the inference process to enhance the model's understanding of the entire video. Finally, we integrate the results of multiple expert models to mitigate the erroneous predictions of a single model. Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.

Paper Structure

This paper contains 11 sections, 2 equations, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: The architecture of Sa2VA yuan2025sa2va. The model first encodes the input texts, visual prompts, images, and videos into token embeddings. These tokens are then processed through a large language model (LLM). The output text tokens are used to generate the "[SEG]" token and associated language outputs. The SAM 2 decoder receives the image and video features from the SAM 2 encoder, along with the "[SEG]" token, to generate corresponding image and video masks.