Table of Contents
Fetching ...

1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation

Deshui Miao, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang

TL;DR

The paper addresses complex video object segmentation under occlusion and object-part fragmentation using MOSE as a benchmark. It introduces a semantic embedding approach that leverages a pretrained ViT to obtain semantic priors and a discriminative query mechanism to maintain robust target representations across frames. The core contributions are the Fusion Block, which fuses ViT-derived semantic priors with multi-scale features via cross-attention and deformable attention, and the Discriminative Query Generation module, which updates target queries using the most discriminative feature channels. Trained on a large MEGA dataset and evaluated on the PVUW 2024 Complex Track, the method achieves the top score of $84.45\%$ on the joint metric $\mathcal{J}\&\mathcal{F}$, demonstrating improved robustness for small and visually similar targets in realistic scenes and enabling practical advances in video object segmentation.

Abstract

Tracking and segmenting multiple objects in complex scenes has always been a challenge in the field of video object segmentation, especially in scenarios where objects are occluded and split into parts. In such cases, the definition of objects becomes very ambiguous. The motivation behind the MOSE dataset is how to clearly recognize and distinguish objects in complex scenes. In this challenge, we propose a semantic embedding video object segmentation model and use the salient features of objects as query representations. The semantic understanding helps the model to recognize parts of the objects and the salient feature captures the more discriminative features of the objects. Trained on a large-scale video object segmentation dataset, our model achieves first place (\textbf{84.45\%}) in the test set of PVUW Challenge 2024: Complex Video Object Segmentation Track.

1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation

TL;DR

The paper addresses complex video object segmentation under occlusion and object-part fragmentation using MOSE as a benchmark. It introduces a semantic embedding approach that leverages a pretrained ViT to obtain semantic priors and a discriminative query mechanism to maintain robust target representations across frames. The core contributions are the Fusion Block, which fuses ViT-derived semantic priors with multi-scale features via cross-attention and deformable attention, and the Discriminative Query Generation module, which updates target queries using the most discriminative feature channels. Trained on a large MEGA dataset and evaluated on the PVUW 2024 Complex Track, the method achieves the top score of on the joint metric , demonstrating improved robustness for small and visually similar targets in realistic scenes and enabling practical advances in video object segmentation.

Abstract

Tracking and segmenting multiple objects in complex scenes has always been a challenge in the field of video object segmentation, especially in scenarios where objects are occluded and split into parts. In such cases, the definition of objects becomes very ambiguous. The motivation behind the MOSE dataset is how to clearly recognize and distinguish objects in complex scenes. In this challenge, we propose a semantic embedding video object segmentation model and use the salient features of objects as query representations. The semantic understanding helps the model to recognize parts of the objects and the salient feature captures the more discriminative features of the objects. Trained on a large-scale video object segmentation dataset, our model achieves first place (\textbf{84.45\%}) in the test set of PVUW Challenge 2024: Complex Video Object Segmentation Track.
Paper Structure (9 sections, 4 figures, 1 table)

This paper contains 9 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Visual examples of the MOSE test set mose. The left is the first frame and the right is the corresponding targets.
  • Figure 2: Overall framework of our methods.
  • Figure 3: Performace on sequences with small targets.
  • Figure 4: Qualitative results on complex sequences.