1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation
Zhuoyan Luo, Yicheng Xiao, Yong Liu, Yitong Wang, Yansong Tang, Xiu Li, Yujiu Yang
TL;DR
This work tackles Referring Video Object Segmentation (RVOS) by integrating multiple state-of-the-art RVOS backbones through a Two-Stage Multi-model Fusion framework. Stage I combines SOC, MUTR, and Referformer with AOT post-processing to produce high-quality masks, while Stage II adds UNINEXT with DeAOT and fuses its output with Stage I to ensure temporal consistency. The approach achieves top performance on the Ref-Youtube-VOS benchmark, with 75.7% J&F on validation and 70% J&F on the test set, outperforming competitors and demonstrating strong generalization via multi-backbone ensemble and targeted post-processing. The work provides practical guidance for leveraging diverse cross-modal architectures and propagation-based refinements in RVOS, with code available for reproduction.
Abstract
The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3. Code is available at https://github.com/RobertLuo1/iccv2023_RVOS_Challenge.
