Table of Contents
Fetching ...

3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

Xinyu Liu, Jing Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang

TL;DR

The paper tackles pixel-level video object segmentation under complex occlusion scenarios by extending the Cutie memory-based VOS framework. It emphasizes object-centric memory management, including a compact object memory and an object transformer, within a three-stage pipeline of image segmentation, tracking, and refinement. Through experiments on the MOSE dataset, it analyzes the impact of memory size, frame sampling rate, and input resolution, reporting a J&F score of 0.8139 and a third-place finish without extra training. The work demonstrates the robustness of memory-driven, high-level semantics in challenging VOS tasks and offers practical guidelines for memory management and inference strategies in real-time settings.

Abstract

Video Object Segmentation (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames. Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance. This report validates the effectiveness of our inference method on the coMplex video Object SEgmentation (MOSE) dataset, which features complex occlusions. Our experimental results demonstrate that our approach achieves a J\&F score of 0.8139 on the test set, securing the third position in the final ranking. These findings highlight the robustness and accuracy of our method in handling challenging VOS scenarios.

3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

TL;DR

The paper tackles pixel-level video object segmentation under complex occlusion scenarios by extending the Cutie memory-based VOS framework. It emphasizes object-centric memory management, including a compact object memory and an object transformer, within a three-stage pipeline of image segmentation, tracking, and refinement. Through experiments on the MOSE dataset, it analyzes the impact of memory size, frame sampling rate, and input resolution, reporting a J&F score of 0.8139 and a third-place finish without extra training. The work demonstrates the robustness of memory-driven, high-level semantics in challenging VOS tasks and offers practical guidelines for memory management and inference strategies in real-time settings.

Abstract

Video Object Segmentation (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames. Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance. This report validates the effectiveness of our inference method on the coMplex video Object SEgmentation (MOSE) dataset, which features complex occlusions. Our experimental results demonstrate that our approach achieves a J\&F score of 0.8139 on the test set, securing the third position in the final ranking. These findings highlight the robustness and accuracy of our method in handling challenging VOS scenarios.
Paper Structure (10 sections, 5 equations, 2 figures, 2 tables)

This paper contains 10 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: VOS framework overview. It consists of three independent components: a segmenter, a referring tracker, and a temporal refiner.
  • Figure 2: The framework of Cutie.