The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation
Tuyen Tran
TL;DR
The paper addresses the Referring Video Object Segmentation (RVOS) challenge with an emphasis on long-term temporal consistency using the MeViS dataset. It introduces a pipeline that combines text-prompted SAM-v2 tracking with a fine-tuned MUTR to generate coarse spatio-temporal masks, followed by a Spatial-Temporal Refinement Module to enforce consistency across time. Empirical results show the approach achieves 60.40 J&F on the MeViS RVOS test set, earning 2nd place at the ECCV 2024 LSVOS Challenge, with ablations highlighting the value of incorporating tracked masklets over coarse predictions. The work demonstrates the practical benefit of integrating external tracking into RVOS pipelines, while also pointing to the ongoing need for end-to-end models that better capture long-term temporal dependencies.
Abstract
Referring Video Object Segmentation (RVOS) is a challenging task due to its requirement for temporal understanding. Due to the obstacle of computational complexity, many state-of-the-art models are trained on short time intervals. During testing, while these models can effectively process information over short time steps, they struggle to maintain consistent perception over prolonged time sequences, leading to inconsistencies in the resulting semantic segmentation masks. To address this challenge, we take a step further in this work by leveraging the tracking capabilities of the newly introduced Segment Anything Model version 2 (SAM-v2) to enhance the temporal consistency of the referring object segmentation model. Our method achieved a score of 60.40 \mathcal{J\text{\&}F} on the test set of the MeViS dataset, placing 2nd place in the final ranking of the RVOS Track at the ECCV 2024 LSVOS Challenge.
