Table of Contents
Fetching ...

The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation

Tuyen Tran

TL;DR

The paper addresses the Referring Video Object Segmentation (RVOS) challenge with an emphasis on long-term temporal consistency using the MeViS dataset. It introduces a pipeline that combines text-prompted SAM-v2 tracking with a fine-tuned MUTR to generate coarse spatio-temporal masks, followed by a Spatial-Temporal Refinement Module to enforce consistency across time. Empirical results show the approach achieves 60.40 J&F on the MeViS RVOS test set, earning 2nd place at the ECCV 2024 LSVOS Challenge, with ablations highlighting the value of incorporating tracked masklets over coarse predictions. The work demonstrates the practical benefit of integrating external tracking into RVOS pipelines, while also pointing to the ongoing need for end-to-end models that better capture long-term temporal dependencies.

Abstract

Referring Video Object Segmentation (RVOS) is a challenging task due to its requirement for temporal understanding. Due to the obstacle of computational complexity, many state-of-the-art models are trained on short time intervals. During testing, while these models can effectively process information over short time steps, they struggle to maintain consistent perception over prolonged time sequences, leading to inconsistencies in the resulting semantic segmentation masks. To address this challenge, we take a step further in this work by leveraging the tracking capabilities of the newly introduced Segment Anything Model version 2 (SAM-v2) to enhance the temporal consistency of the referring object segmentation model. Our method achieved a score of 60.40 \mathcal{J\text{\&}F} on the test set of the MeViS dataset, placing 2nd place in the final ranking of the RVOS Track at the ECCV 2024 LSVOS Challenge.

The 2nd Solution for LSVOS Challenge RVOS Track: Spatial-temporal Refinement for Consistent Semantic Segmentation

TL;DR

The paper addresses the Referring Video Object Segmentation (RVOS) challenge with an emphasis on long-term temporal consistency using the MeViS dataset. It introduces a pipeline that combines text-prompted SAM-v2 tracking with a fine-tuned MUTR to generate coarse spatio-temporal masks, followed by a Spatial-Temporal Refinement Module to enforce consistency across time. Empirical results show the approach achieves 60.40 J&F on the MeViS RVOS test set, earning 2nd place at the ECCV 2024 LSVOS Challenge, with ablations highlighting the value of incorporating tracked masklets over coarse predictions. The work demonstrates the practical benefit of integrating external tracking into RVOS pipelines, while also pointing to the ongoing need for end-to-end models that better capture long-term temporal dependencies.

Abstract

Referring Video Object Segmentation (RVOS) is a challenging task due to its requirement for temporal understanding. Due to the obstacle of computational complexity, many state-of-the-art models are trained on short time intervals. During testing, while these models can effectively process information over short time steps, they struggle to maintain consistent perception over prolonged time sequences, leading to inconsistencies in the resulting semantic segmentation masks. To address this challenge, we take a step further in this work by leveraging the tracking capabilities of the newly introduced Segment Anything Model version 2 (SAM-v2) to enhance the temporal consistency of the referring object segmentation model. Our method achieved a score of 60.40 \mathcal{J\text{\&}F} on the test set of the MeViS dataset, placing 2nd place in the final ranking of the RVOS Track at the ECCV 2024 LSVOS Challenge.
Paper Structure (9 sections, 1 equation, 3 figures, 2 tables)

This paper contains 9 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: We first extract the main noun from the given textual query (e.g., Cat ) and use it as input for the Text-Prompted SAM-2. This module essentially combines Grounding Dino and SAMv2. Grounding Dino detects all bounding boxes of instances belonging to the specified object category. These boxes are then used as prompt input for the SAMv2 model, resulting in a sequence of spatio-temporal masks. Concurrently, a fine-tuned MUTR model is employed to generate coarse masks from the input video. These initial masks are then subjected to refinement within the Spatial-Temporal Refinement Module, resulting in final segmentation masks with improved temporal consistency.
  • Figure 2: Spatial-temporal Refinement Algorithm: We utilize the coarse prediction masks $\left\{ u^{t}\right\}$ and tracked masks $\left\{ v_{i}^{t}\right\}$ to construct component combinations $C_{t}$ for each time step $t$. In the example shown, during time steps $1,2,4$, and $5,$ the coarse prediction from baseline consistently segments only the cat tracked with ID $2$ (yellow mask). However, at time step $3$, both cats are segmented, resulting in $C_{3}=(1,2)$. To maintain the temporal consistency, we select the most frequent combination, $(2,)$ and apply it to refine all frames within the window size.
  • Figure 3: Qualitative results on the MeViS validation set: We present examples to showcase the effectiveness of the proposed approach. The baseline is initial coarse masks obtained from MUTR. It is shown that while the baseline can produce accurate masks for short periods of time, it struggles to maintain consistency over longer periods. By utilizing the tracking capabilities of SAM-2, we refine the initial coarse masks to achieve improved consistency across both spatial and temporal dimensions.