Table of Contents
Fetching ...

SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track

Dengxian Gong, Quanzhu Niu, Shihao Chen, Yuanzheng Wu, Yikang Zhou, Tao Zhang, Haobo Yuan, Lu Qi, Shunping Ji

Abstract

Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.

SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track

Abstract

Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and introducing no-target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence-aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS-Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence-aware verification strategy is sufficient to unlock strong performance on motion-centric referring tasks.

Paper Structure

This paper contains 10 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The architecture of SaSaSa2VA niu20251st. Given a video of $T$ frames, the sequence is first split into $N$ temporally ordered clips, each consisting of $c = g^2{+}1$ frames. To improve efficiency while preserving temporal context, frames within each clip are compacted via the Key Frame Compression (KFC) strategy before being fed into the MLLM. Based on the compressed visual inputs, the MLLM generates a set of $N$[SEG] tokens, where each token encodes segmentation cues for a specific temporal segment. For each clip, SAM2 takes the corresponding [SEG] token as a prompt, together with the original (uncompressed) frames, to decode object masks at the frame level. In this illustration, we set $c=5$, corresponding to $g=2$.
  • Figure 2: The Existence-aware verification illustration of our method. Given a video-expression pair $(\mathcal{V}, \mathcal{T})$, Gemini 3-Flash-Preview and GPT-5.4 function as a dual-consensus jury, and an expression is categorized as 'null-target' only under a unanimous consensus, where both models independently confirm the object's absence. Only valid video-text pairs proceed to the SaSaSa2VA base model for inference.