Table of Contents
Fetching ...

LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, Lingling Li

TL;DR

The paper tackles robust Video Object Segmentation (VOS) in challenging, long-horizon videos with occlusion and reappearance. It fuses SAM2's memory-based segmentation with the semi-supervised Cutie framework, leveraging a memory module, object queries, and high-resolution features to maintain accurate masks over time. Key contributions include a detailed inference configuration, memory-management strategies, and demonstration on LVOS tracks that achieve a test-phase J&F of $0.8388$ (with $J=0.7952$, $F=0.7516$), ranking third. The approach demonstrates improved robustness to occlusion, small targets, and distractors, highlighting practical potential for reliable VOS in real-world scenarios.

Abstract

Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overall.

LSVOS Challenge 3rd Place Report: SAM2 and Cutie based VOS

TL;DR

The paper tackles robust Video Object Segmentation (VOS) in challenging, long-horizon videos with occlusion and reappearance. It fuses SAM2's memory-based segmentation with the semi-supervised Cutie framework, leveraging a memory module, object queries, and high-resolution features to maintain accurate masks over time. Key contributions include a detailed inference configuration, memory-management strategies, and demonstration on LVOS tracks that achieve a test-phase J&F of (with , ), ranking third. The approach demonstrates improved robustness to occlusion, small targets, and distractors, highlighting practical potential for reliable VOS in real-world scenarios.

Abstract

Video Object Segmentation (VOS) presents several challenges, including object occlusion and fragmentation, the dis-appearance and re-appearance of objects, and tracking specific objects within crowded scenes. In this work, we combine the strengths of the state-of-the-art (SOTA) models SAM2 and Cutie to address these challenges. Additionally, we explore the impact of various hyperparameters on video instance segmentation performance. Our approach achieves a J\&F score of 0.7952 in the testing phase of LSVOS challenge VOS track, ranking third overall.
Paper Structure (6 sections, 5 equations, 4 figures, 1 table)

This paper contains 6 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An overview of the VOS framework. The figure illustrates the key components of our approach, including the memory-based paradigm, pixel-level matching, and object query mechanism.
  • Figure 2: The SAM 2 architecture
  • Figure 3: The Cutie architecture
  • Figure 4: Performance on sequences with small targets.