Table of Contents
Fetching ...

Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference

Alexey Nekrasov, Ali Athar, Daan de Geus, Alexander Hermans, Bastian Leibe

TL;DR

The paper identifies a critical mismatch between training and inference in Sa2VA that limits performance on referring video segmentation. It introduces Sa2VA-i, a consistency-focused variant that predicts initial per-frame masks without memory conditioning during both training and inference and uses the frozen SAM2 for propagation to full videos, aided by uniform frame sampling. This approach achieves state-of-the-art results across MeViS, Ref-YT-VOS, Ref-DAVIS, and ReVOS benchmarks, with improvements such as up to $+11.6$ in $ ext{J} ext{&} ext{F}$ on MeViS while maintaining a modest memory footprint (~16MB). Comprehensive ablations demonstrate the impact of consistency and sampling strategies, and the authors release updated models and code for reproducibility.

Abstract

Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i

Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference

TL;DR

The paper identifies a critical mismatch between training and inference in Sa2VA that limits performance on referring video segmentation. It introduces Sa2VA-i, a consistency-focused variant that predicts initial per-frame masks without memory conditioning during both training and inference and uses the frozen SAM2 for propagation to full videos, aided by uniform frame sampling. This approach achieves state-of-the-art results across MeViS, Ref-YT-VOS, Ref-DAVIS, and ReVOS benchmarks, with improvements such as up to in on MeViS while maintaining a modest memory footprint (~16MB). Comprehensive ablations demonstrate the impact of consistency and sampling strategies, and the authors release updated models and code for reproducibility.

Abstract

Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i

Paper Structure

This paper contains 10 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Sa2VA-i Architecture Overview. Sa2VA-i first predicts initial masks $\mathcal{M}_T$ using a finetuned SAM2 mask decoder by taking the generated [SEG] token and predicting a mask for all $T$ sampled frames separately. Next, it propagates these $\mathcal{M}_T$ masks across all video frames $I$ using SAM2's original mask decoder, yielding output masks $\mathcal{M}_I$.