3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation
Xinyu Liu, Jing Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang
TL;DR
The paper tackles pixel-level video object segmentation under complex occlusion scenarios by extending the Cutie memory-based VOS framework. It emphasizes object-centric memory management, including a compact object memory and an object transformer, within a three-stage pipeline of image segmentation, tracking, and refinement. Through experiments on the MOSE dataset, it analyzes the impact of memory size, frame sampling rate, and input resolution, reporting a J&F score of 0.8139 and a third-place finish without extra training. The work demonstrates the robustness of memory-driven, high-level semantics in challenging VOS tasks and offers practical guidelines for memory management and inference strategies in real-time settings.
Abstract
Video Object Segmentation (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames. Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance. This report validates the effectiveness of our inference method on the coMplex video Object SEgmentation (MOSE) dataset, which features complex occlusions. Our experimental results demonstrate that our approach achieves a J\&F score of 0.8139 on the test set, securing the third position in the final ranking. These findings highlight the robustness and accuracy of our method in handling challenging VOS scenarios.
