4th PVUW MeViS 3rd Place Report: Sa2VA
Haobo Yuan, Tao Zhang, Xiangtai Li, Lu Qi, Zilong Huang, Shilin Xu, Jiashi Feng, Ming-Hsuan Yang
TL;DR
This work tackles RVOS on the motion-focused MeViS dataset by leveraging a grounded MLLM framework (Sa2VA) that combines InternVL2.5 with SAM-2. A training-free test-time augmentation, Long-Interleaved Inference (LII), samples key frames across an extended temporal window to enrich motion-context embeddings. The approach achieves 56.3 J&F and third place among 32 teams in the 4th PVUW MeViS challenge, underscoring the value of inference-time design and long-range temporal context. The findings suggest that grounded MLLMs with smart test-time strategies can yield substantial gains in motion-expression RVOS without additional dataset-specific training, guiding future research in RVOS and video-grounded understanding.
Abstract
Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.
