LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

Xinyu Liu; Jing Zhang; Kexin Zhang; Xu Liu; Lingling Li

LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, Lingling Li

TL;DR

The paper tackles robust Video Object Segmentation (VOS) in challenging, long-horizon videos with occlusion and reappearance. It fuses SAM2's memory-based segmentation with the semi-supervised Cutie framework, leveraging a memory module, object queries, and high-resolution features to maintain accurate masks over time. Key contributions include a detailed inference configuration, memory-management strategies, and demonstration on LVOS tracks that achieve a test-phase J&F of $0.8388$ (with $J=0.7952$, $F=0.7516$), ranking third. The approach demonstrates improved robustness to occlusion, small targets, and distractors, highlighting practical potential for reliable VOS in real-world scenarios.

Abstract

Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overall.

LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

TL;DR

(with

), ranking third. The approach demonstrates improved robustness to occlusion, small targets, and distractors, highlighting practical potential for reliable VOS in real-world scenarios.

Abstract

Paper Structure (6 sections, 5 equations, 4 figures, 1 table)

This paper contains 6 sections, 5 equations, 4 figures, 1 table.

Introduction
Method
Experiment
Inference
Evaluation Metrics
Conclusion

Figures (4)

Figure 1: An overview of the VOS framework. The figure illustrates the key components of our approach, including the memory-based paradigm, pixel-level matching, and object query mechanism.
Figure 2: The SAM 2 architecture
Figure 3: The Cutie architecture
Figure 4: Performance on sequences with small targets.

LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

TL;DR

Abstract

LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

Authors

TL;DR

Abstract

Table of Contents

Figures (4)