LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS
Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, Lingling Li
TL;DR
The paper tackles robust Video Object Segmentation (VOS) in challenging, long-horizon videos with occlusion and reappearance. It fuses SAM2's memory-based segmentation with the semi-supervised Cutie framework, leveraging a memory module, object queries, and high-resolution features to maintain accurate masks over time. Key contributions include a detailed inference configuration, memory-management strategies, and demonstration on LVOS tracks that achieve a test-phase J&F of $0.8388$ (with $J=0.7952$, $F=0.7516$), ranking third. The approach demonstrates improved robustness to occlusion, small targets, and distractors, highlighting practical potential for reliable VOS in real-world scenarios.
Abstract
Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overall.
