2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

Zhensong Xu; Jiangtao Yao; Chengjing Wu; Ting Liu; Luoqi Liu

2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

Zhensong Xu, Jiangtao Yao, Chengjing Wu, Ting Liu, Luoqi Liu

TL;DR

This work addresses Complex Video Object Segmentation on MOSE within PVUW 2024 by extending the memory-based Cutie model with data augmentation and inference-time enhancements. It enriches training data with Mask2Former-generated MOSE masks and COCO-derived binary masks, and introduces motion blur to improve robustness to blur and small, similar objects. At inference, test-time augmentation and a carefully tuned memory strategy further boost performance, achieving a MOSE score of $J=0.8007$, $F=0.8683$, and $J\&F=0.8345$ (2nd place). The approach demonstrates that targeted data augmentation and memory-aware inference can substantially improve semi-supervised VOS under challenging MOSE conditions, with practical implications for video editing and annotation pipelines.

Abstract

Complex video object segmentation serves as a fundamental task for a wide range of downstream applications such as video editing and automatic data annotation. Here we present the 2nd place solution in the MOSE track of PVUW 2024. To mitigate problems caused by tiny objects, similar objects and fast movements in MOSE. We use instance segmentation to generate extra pretraining data from the valid and test set of MOSE. The segmented instances are combined with objects extracted from COCO to augment the training data and enhance semantic representation of the baseline model. Besides, motion blur is added during training to increase robustness against image blur induced by motion. Finally, we apply test time augmentation (TTA) and memory strategy to the inference stage. Our method ranked 2nd in the MOSE track of PVUW 2024, with a $\mathcal{J}$ of 0.8007, a $\mathcal{F}$ of 0.8683 and a $\mathcal{J}$\&$\mathcal{F}$ of 0.8345.

2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

TL;DR

, and

(2nd place). The approach demonstrates that targeted data augmentation and memory-aware inference can substantially improve semi-supervised VOS under challenging MOSE conditions, with practical implications for video editing and annotation pipelines.

Abstract

of 0.8007, a

of 0.8683 and a

of 0.8345.

Paper Structure (11 sections, 4 figures, 2 tables)

This paper contains 11 sections, 4 figures, 2 tables.

Introduction
Method
Baseline model
Data augmentation
Inference time operations
TTA.
experiment
Implementation details
Results in the 1st MOSE challenge
Ablation study
Conclusion

Figures (4)

Figure 1: Overview of our method.
Figure 2: Architecture of Cutier8.
Figure 3: Examples of generated pretraining data and motion blur. Left: binary mask generated from the valid set and test set of MOSE. Middle: binary mask generated from COCO, the masks of different classes are merged into one mask. Right: example of motion blur in the horizontal direction.
Figure 4: Qualitative results on the test set of MOSE.

2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

TL;DR

Abstract

2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)