The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

Bin Cao; Yisi Zhang; Hanyi Wang; Xingjian He; Jing Liu

The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

Bin Cao, Yisi Zhang, Hanyi Wang, Xingjian He, Jing Liu

TL;DR

This work addresses language-guided video object segmentation (RVOS) on the challenging MeViS dataset by introducing an instance-centric transformer framework that combines a DETR-based MUTR model with explicit instance masks, a DVIS-driven instance retrieval module for long-sequence referring, and a HQ-SAM refinement stage. A fusion strategy integrates frame-level and instance-level predictions to improve temporal consistency and spatial accuracy, achieving 52.67 in $ ext{J} extbackslash& ext{F}$ on validation and 60.36 on the test set. The results show that leveraging instance trajectories, long-sequence retrieval, and SAM-based refinement can significantly enhance RVOS performance, attaining a 3rd-place finish in the RVOS track. The method highlights the value of cross-modal querying, instance-centric query initialization, and multi-level fusion for robust language-conditioned video segmentation in complex expressions.Overall, the approach demonstrates strong potential for scalable, temporally coherent RVOS systems in real-world, motion-aware language understanding tasks.

Abstract

Referring Video Object Segmentation is an emerging multi-modal task that aims to segment objects in the video given a natural language expression. In this work, we build two instance-centric models and fuse predicted results from frame-level and instance-level. First, we introduce instance mask into the DETR-based model for query initialization to achieve temporal enhancement and employ SAM for spatial refinement. Secondly, we build an instance retrieval model conducting binary instance mask classification whether the instance is referred. Finally, we fuse predicted results and our method achieved a score of 52.67 J&F in the validation phase and 60.36 J&F in the test phase, securing the final ranking of 3rd place in the 6-th LSVOS Challenge RVOS Track.

The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

TL;DR

on validation and 60.36 on the test set. The results show that leveraging instance trajectories, long-sequence retrieval, and SAM-based refinement can significantly enhance RVOS performance, attaining a 3rd-place finish in the RVOS track. The method highlights the value of cross-modal querying, instance-centric query initialization, and multi-level fusion for robust language-conditioned video segmentation in complex expressions.Overall, the approach demonstrates strong potential for scalable, temporally coherent RVOS systems in real-world, motion-aware language understanding tasks.

Abstract

Paper Structure (12 sections, 5 equations, 1 figure, 2 tables)

This paper contains 12 sections, 5 equations, 1 figure, 2 tables.

Introduction
Method
Overview
MUTR-based Model
HQ-SAM for Spatial Refinement
Instance Retrieval Model
Fusion Strategy
Experiments
Dataset and Metrics
Implement Details
Ablation Experiments
Competition Results

Figures (1)

Figure 1: The architecture of MUTR-based model. We employ MUTR as our basic model (Left). We introduce instance masks and employ an attention block and a sequential mechanism to aggregate instance information into a query (Right).

The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

TL;DR

Abstract

The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution

Authors

TL;DR

Abstract

Table of Contents

Figures (1)