Table of Contents
Fetching ...

4th PVUW MeViS 3rd Place Report: Sa2VA

Haobo Yuan, Tao Zhang, Xiangtai Li, Lu Qi, Zilong Huang, Shilin Xu, Jiashi Feng, Ming-Hsuan Yang

TL;DR

This work tackles RVOS on the motion-focused MeViS dataset by leveraging a grounded MLLM framework (Sa2VA) that combines InternVL2.5 with SAM-2. A training-free test-time augmentation, Long-Interleaved Inference (LII), samples key frames across an extended temporal window to enrich motion-context embeddings. The approach achieves 56.3 J&F and third place among 32 teams in the 4th PVUW MeViS challenge, underscoring the value of inference-time design and long-range temporal context. The findings suggest that grounded MLLMs with smart test-time strategies can yield substantial gains in motion-expression RVOS without additional dataset-specific training, guiding future research in RVOS and video-grounded understanding.

Abstract

Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.

4th PVUW MeViS 3rd Place Report: Sa2VA

TL;DR

This work tackles RVOS on the motion-focused MeViS dataset by leveraging a grounded MLLM framework (Sa2VA) that combines InternVL2.5 with SAM-2. A training-free test-time augmentation, Long-Interleaved Inference (LII), samples key frames across an extended temporal window to enrich motion-context embeddings. The approach achieves 56.3 J&F and third place among 32 teams in the 4th PVUW MeViS challenge, underscoring the value of inference-time design and long-range temporal context. The findings suggest that grounded MLLMs with smart test-time strategies can yield substantial gains in motion-expression RVOS without additional dataset-specific training, guiding future research in RVOS and video-grounded understanding.

Abstract

Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.

Paper Structure

This paper contains 8 sections, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The Sa2VA model. The model first encodes the input texts, visual prompts, images, and videos into token embeddings. These tokens are then processed through a large language model (LLM). The output text tokens are used to generate the "[SEG]" token and associated language outputs. The SAM-2 decoder receives the image and video features from the SAM-2 encoder, along with the "[SEG]" token, to generate corresponding image and video masks. Modules with a redfire icon are trained during the one-shot instruction-tuning. Note that we do not train the model for MeVIS dataset and we only adopt pre-trained model yuan2025sa2va for inference.
  • Figure 2: Visualization comparison. Sa2VA with Long-Interleaved Inference (LII) pipeline (i.e., w LII) shows with better understanding of the motion information in longer videos compared to without the LII pipeline (w/o LLI).