The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji
TL;DR
This work addresses the challenge of referring video object segmentation (RVOS) by grounding segmentation in natural-language expressions using a multi-modal large language model (MLLM) coupled with SAM2. It identifies two practical bottlenecks in the Sa2VA baseline—sparse frame sampling and a single [SEG] token for whole-video prompts—and introduces Segmentation Augmentation (Key Frame Compression and multi-[SEG] tokens) plus test-time Selective Averaging. The proposed SaSaSa2VA demonstrates strong empirical gains, achieving a J&F of 67.45 and winning first place at the 7th LSVOS RVOS track, with ablations confirming the effectiveness of augmentation and ensembling. The results underscore the value of augmenting temporal coverage and fusing diverse predictions to improve grounded MLLMs for video segmentation in real-world, language-conditioned settings.
Abstract
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA (SaSaSa2VA) to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $\mathcal{J\&F}$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/bytedance/Sa2VA.
