Table of Contents
Fetching ...

FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma

TL;DR

The paper addresses semi-supervised Video Object Segmentation in complex MOSE scenes by fine-tuning a strong Transformer-based backbone (SAM 2-Large) on MOSE, augmenting predictions with a dilation-based morphological post-processing step to reduce gaps between adjacent objects, and applying multi-scale test-time fusion via voting. The proposed FVOS framework comprises three components: MOSE-specific fine-tuning, morphological post-processing, and multi-scale prediction fusion, with a two-stage training scheme yielding robust single-model results. On MOSE, FVOS achieves 76.81% J&F on the validation set and 83.92% J&F on the test set, securing third place on the challenge leaderboard, while demonstrating the value of dataset-tailored optimization and post-processing strategies for challenging VOS scenarios. These findings suggest that combining targeted fine-tuning with structural mask refinements and multi-scale ensemble techniques can substantially improve performance in real-world VOS applications such as autonomous navigation and video editing.

Abstract

Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.

FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

TL;DR

The paper addresses semi-supervised Video Object Segmentation in complex MOSE scenes by fine-tuning a strong Transformer-based backbone (SAM 2-Large) on MOSE, augmenting predictions with a dilation-based morphological post-processing step to reduce gaps between adjacent objects, and applying multi-scale test-time fusion via voting. The proposed FVOS framework comprises three components: MOSE-specific fine-tuning, morphological post-processing, and multi-scale prediction fusion, with a two-stage training scheme yielding robust single-model results. On MOSE, FVOS achieves 76.81% J&F on the validation set and 83.92% J&F on the test set, securing third place on the challenge leaderboard, while demonstrating the value of dataset-tailored optimization and post-processing strategies for challenging VOS scenarios. These findings suggest that combining targeted fine-tuning with structural mask refinements and multi-scale ensemble techniques can substantially improve performance in real-world VOS applications such as autonomous navigation and video editing.

Abstract

Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Network Architecture of our FVOS. It mainly consists of a query encoder, a memory encoder, a mask decoder and attention Transformer blocks.
  • Figure 2: Morphological post-processing results on the MOSE test dataset. (a) kernel=0. (b) kernel=2. (c) kernel=3. (d) kernel=5.
  • Figure 3: Test time data augmentation and multi-scale magnification operations. (a) original image. (b) clockwise by 90$^\circ$. (c) clockwise by 180$^\circ$. (d) clockwise by 270$^\circ$. (e) horizontal flipping. (f) multi-scale magnification.