Table of Contents
Fetching ...

MASSeg : 2nd Technical Report for 4th PVUW MOSE Track

Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Puhua Chen, Wenping Ma

TL;DR

The paper tackles robust pixel-level video object segmentation under challenging MOSE conditions such as occlusion and small object instances. It introduces MASSeg, a transformer-based VOS method built on the SAM2 framework, augmented with a mask output scaling mechanism and trained on the enhanced MOSE+ dataset. A multi-task loss $ \mathcal{L}_{total} = \lambda_1 \mathcal{L}_{CE} + \lambda_2 \mathcal{L}_{Dice} + \lambda_3 \mathcal{L}_{Sim} + \lambda_4 \mathcal{L}_{MaskIoU} $, frame-consistent/inconsistent data augmentations, and an inference-time mask confidence control strategy are key components, with ablations showing progressive gains leading to a final J&F of 0.8628 on MOSE. The approach achieves 2nd place in the PVUW CVPR 2025 MOSE track, demonstrating improved robustness and generalization for real-world VOS tasks.

Abstract

Complex video object segmentation continues to face significant challenges in small object recognition, occlusion handling, and dynamic scene modeling. This report presents our solution, which ranked second in the MOSE track of CVPR 2025 PVUW Challenge. Based on an existing segmentation framework, we propose an improved model named MASSeg for complex video object segmentation, and construct an enhanced dataset, MOSE+, which includes typical scenarios with occlusions, cluttered backgrounds, and small target instances. During training, we incorporate a combination of inter-frame consistent and inconsistent data augmentation strategies to improve robustness and generalization. During inference, we design a mask output scaling strategy to better adapt to varying object sizes and occlusion levels. As a result, MASSeg achieves a J score of 0.8250, F score of 0.9007, and a J&F score of 0.8628 on the MOSE test set.

MASSeg : 2nd Technical Report for 4th PVUW MOSE Track

TL;DR

The paper tackles robust pixel-level video object segmentation under challenging MOSE conditions such as occlusion and small object instances. It introduces MASSeg, a transformer-based VOS method built on the SAM2 framework, augmented with a mask output scaling mechanism and trained on the enhanced MOSE+ dataset. A multi-task loss , frame-consistent/inconsistent data augmentations, and an inference-time mask confidence control strategy are key components, with ablations showing progressive gains leading to a final J&F of 0.8628 on MOSE. The approach achieves 2nd place in the PVUW CVPR 2025 MOSE track, demonstrating improved robustness and generalization for real-world VOS tasks.

Abstract

Complex video object segmentation continues to face significant challenges in small object recognition, occlusion handling, and dynamic scene modeling. This report presents our solution, which ranked second in the MOSE track of CVPR 2025 PVUW Challenge. Based on an existing segmentation framework, we propose an improved model named MASSeg for complex video object segmentation, and construct an enhanced dataset, MOSE+, which includes typical scenarios with occlusions, cluttered backgrounds, and small target instances. During training, we incorporate a combination of inter-frame consistent and inconsistent data augmentation strategies to improve robustness and generalization. During inference, we design a mask output scaling strategy to better adapt to varying object sizes and occlusion levels. As a result, MASSeg achieves a J score of 0.8250, F score of 0.9007, and a J&F score of 0.8628 on the MOSE test set.

Paper Structure

This paper contains 10 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Representative challenges in complex video object segmentation. The examples showcase small object motion (left), appearance confusion with occlusion (middle), and densely cluttered scenes (right), which reflect typical situations encountered in the MOSE dataset.
  • Figure 2: Overview of our method.
  • Figure 3: Visualization of our data augmentation strategies. Each column shows an example before (top) and after (bottom) applying augmentation. From left to right: (a) geometric transformation (e.g., affine distortion), (b) color jittering with inconsistent color shift, and (c) grayscale conversion. These augmentations simulate realistic variations in pose, illumination, and appearance, improving robustness and generalization of the model in complex scenarios.
  • Figure 4: Qualitative results of our method on challenging MOSE test sequences. Our model accurately segments small objects, handles severe occlusions, and maintains temporal consistency across fast-moving and cluttered scenes.