The Instance-centric Transformer for the RVOS Track of LSVOS Challenge: 3rd Place Solution
Bin Cao, Yisi Zhang, Hanyi Wang, Xingjian He, Jing Liu
TL;DR
This work addresses language-guided video object segmentation (RVOS) on the challenging MeViS dataset by introducing an instance-centric transformer framework that combines a DETR-based MUTR model with explicit instance masks, a DVIS-driven instance retrieval module for long-sequence referring, and a HQ-SAM refinement stage. A fusion strategy integrates frame-level and instance-level predictions to improve temporal consistency and spatial accuracy, achieving 52.67 in $ ext{J} extbackslash& ext{F}$ on validation and 60.36 on the test set. The results show that leveraging instance trajectories, long-sequence retrieval, and SAM-based refinement can significantly enhance RVOS performance, attaining a 3rd-place finish in the RVOS track. The method highlights the value of cross-modal querying, instance-centric query initialization, and multi-level fusion for robust language-conditioned video segmentation in complex expressions.Overall, the approach demonstrates strong potential for scalable, temporally coherent RVOS systems in real-world, motion-aware language understanding tasks.
Abstract
Referring Video Object Segmentation is an emerging multi-modal task that aims to segment objects in the video given a natural language expression. In this work, we build two instance-centric models and fuse predicted results from frame-level and instance-level. First, we introduce instance mask into the DETR-based model for query initialization to achieve temporal enhancement and employ SAM for spatial refinement. Secondly, we build an instance retrieval model conducting binary instance mask classification whether the instance is referred. Finally, we fuse predicted results and our method achieved a score of 52.67 J&F in the validation phase and 60.36 J&F in the test phase, securing the final ranking of 3rd place in the 6-th LSVOS Challenge RVOS Track.
