CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track
Jinming Chai, Qin Ma, Junpei Zhang, Licheng Jiao, Fang Liu
TL;DR
CSS-Segment targets video object segmentation in long-term LVOS scenarios by fusing strengths from Cutie, SAM, and SAM2 into a streaming, memory-aware architecture. It combines a streaming Image Encoder, a SAM-inspired Mask Encoder, an Object Transformer with an Object Memory, and memory-based reading to maintain object-centric segmentation across long sequences. The approach emphasizes cross-module memory reading, multi-scale inference, and fusion strategies to boost robustness against motion, occlusion, and reappearance. Empirically, CSS-Segment achieves a J&F score of 80.84 and ranks 2nd in the ECCV 2024 LSVOS VOS Track, highlighting the effectiveness of object-level memory and cross-module integration for challenging video segmentation tasks.
Abstract
Video object segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. In this technical report, we briefly introduce the solution of our team "yuanjie" for video object segmentation in the 6-th LSVOS Challenge VOS Track at ECCV 2024. We believe that our proposed CSS-Segment will perform better in videos of complex object motion and long-term presentation. In this report, we successfully validated the effectiveness of the CSS-Segment in video object segmentation. Finally, our method achieved a J\&F score of 80.84 in and test phases, and ultimately ranked 2nd in the 6-th LSVOS Challenge VOS Track at ECCV 2024.
