Table of Contents
Fetching ...

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji

TL;DR

This work addresses the challenge of referring video object segmentation (RVOS) by grounding segmentation in natural-language expressions using a multi-modal large language model (MLLM) coupled with SAM2. It identifies two practical bottlenecks in the Sa2VA baseline—sparse frame sampling and a single [SEG] token for whole-video prompts—and introduces Segmentation Augmentation (Key Frame Compression and multi-[SEG] tokens) plus test-time Selective Averaging. The proposed SaSaSa2VA demonstrates strong empirical gains, achieving a J&F of 67.45 and winning first place at the 7th LSVOS RVOS track, with ablations confirming the effectiveness of augmentation and ensembling. The results underscore the value of augmenting temporal coverage and fusing diverse predictions to improve grounded MLLMs for video segmentation in real-world, language-conditioned settings.

Abstract

Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA (SaSaSa2VA) to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $\mathcal{J\&F}$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/bytedance/Sa2VA.

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

TL;DR

This work addresses the challenge of referring video object segmentation (RVOS) by grounding segmentation in natural-language expressions using a multi-modal large language model (MLLM) coupled with SAM2. It identifies two practical bottlenecks in the Sa2VA baseline—sparse frame sampling and a single [SEG] token for whole-video prompts—and introduces Segmentation Augmentation (Key Frame Compression and multi-[SEG] tokens) plus test-time Selective Averaging. The proposed SaSaSa2VA demonstrates strong empirical gains, achieving a J&F of 67.45 and winning first place at the 7th LSVOS RVOS track, with ablations confirming the effectiveness of augmentation and ensembling. The results underscore the value of augmenting temporal coverage and fusing diverse predictions to improve grounded MLLMs for video segmentation in real-world, language-conditioned settings.

Abstract

Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA (SaSaSa2VA) to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/bytedance/Sa2VA.

Paper Structure

This paper contains 10 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of Segmentation Augmentation in SaSaSa2VA. We design Key Frame Compression (KFC) strategy and Scaling [SEG] tokens strategy over Sa2VA yuan2025sa2va. A $T$-frame video is divided into $N$ non-overlapping clips, each containing $c = g^2{+}1$ frames. The frames passed to the MLLM are then compressed via KFC. The MLLM outputs $N$[SEG] tokens, each corresponding to one clip. For a given clip, conditioned on the original $c$ frames and the hidden state of its [SEG] token, SAM2 decodes the masks for that clip. In this figure, $c$ is set to $5$, resulting $g=2$.