Referring Video Object Segmentation via Language-aligned Track Selection

Seongchan Kim; Woojeong Jin; Sangbeom Lim; Heeji Yoon; Hyunwook Choi; Seungryong Kim

Referring Video Object Segmentation via Language-aligned Track Selection

Seongchan Kim, Woojeong Jin, Sangbeom Lim, Heeji Yoon, Hyunwook Choi, Seungryong Kim

TL;DR

This work tackles RVOS by using SAM2 object tokens as compact video-level representations and a lightweight language-aligned track selection module to bridge vision-language gaps. It introduces an IoU-based pseudo-labeling strategy to supervise the alignment while keeping SAM2 frozen, and trains a small set of parameters on a single GPU. The method achieves state-of-the-art results on the MeViS dataset (e.g., $J$ and $F$ scores of 48.6 with 32.9M trainable parameters) and demonstrates strong generalization in zero-shot and cross-dataset settings, including Ref-YouTube-VOS and Ref-DAVIS. This approach offers efficient, robust RVOS with improved multi-modal alignment and motion understanding, expanding practical applicability in interactive video tasks.

Abstract

Referring video object segmentation (RVOS) requires tracking and segmenting an object throughout a video according to a given natural language expression, demanding both complex motion understanding and the alignment of visual representations with language descriptions. Given these challenges, the recently proposed Segment Anything Model 2 (SAM2) emerges as a potential candidate due to its ability to generate coherent segmentation mask tracks across video frames, and provide an inherent spatio-temporal objectness in its object token representations. In this paper, we introduce SOLA (Selection by Object Language Alignment), a novel framework that leverages SAM2 object tokens as compact video-level object representations, which are aligned with language features through a lightweight track selection module. To effectively facilitate this alignment, we propose an IoU-based pseudo-labeling strategy, which bridges the modality gap between SAM2 representations with language features. Extensive experiments show that SOLA achieves state-of-the-art performance on the MeViS dataset and demonstrate that SOLA offers an effective solution for RVOS. Our project page is available at: https://cvlab-kaist.github.io/SOLA.

Referring Video Object Segmentation via Language-aligned Track Selection

TL;DR

Abstract

Referring Video Object Segmentation via Language-aligned Track Selection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)