CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track

Jinming Chai; Qin Ma; Junpei Zhang; Licheng Jiao; Fang Liu

CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track

Jinming Chai, Qin Ma, Junpei Zhang, Licheng Jiao, Fang Liu

TL;DR

CSS-Segment targets video object segmentation in long-term LVOS scenarios by fusing strengths from Cutie, SAM, and SAM2 into a streaming, memory-aware architecture. It combines a streaming Image Encoder, a SAM-inspired Mask Encoder, an Object Transformer with an Object Memory, and memory-based reading to maintain object-centric segmentation across long sequences. The approach emphasizes cross-module memory reading, multi-scale inference, and fusion strategies to boost robustness against motion, occlusion, and reappearance. Empirically, CSS-Segment achieves a J&F score of 80.84 and ranks 2nd in the ECCV 2024 LSVOS VOS Track, highlighting the effectiveness of object-level memory and cross-module integration for challenging video segmentation tasks.

Abstract

Video object segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. In this technical report, we briefly introduce the solution of our team "yuanjie" for video object segmentation in the 6-th LSVOS Challenge VOS Track at ECCV 2024. We believe that our proposed CSS-Segment will perform better in videos of complex object motion and long-term presentation. In this report, we successfully validated the effectiveness of the CSS-Segment in video object segmentation. Finally, our method achieved a J\&F score of 80.84 in and test phases, and ultimately ranked 2nd in the 6-th LSVOS Challenge VOS Track at ECCV 2024.

CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track

TL;DR

Abstract

Paper Structure (12 sections, 1 figure)

This paper contains 12 sections, 1 figure.

Introduction
Method
Image Encoder
Mask Encoder
Object Transformer
Object Memory
Experiment
Dataset
Fine-tune
Inference
Muti level Fusion
Conclusion

Figures (1)

Figure 1: Workflow of the CSS-Segment. Image encoder is a streaming approach, consuming video frames as they become available. Mask encoder using convolutions and summed element-wise with the image embedding. We store pixel memory and object memory representations from past segmented (memory) frames. Pixel memory is retrieved for the query frame as pixel readout, which bidirectionally interacts with object queries and object memory in the object transformer. The object transformer blocks enrich the pixel feature with object-level semantics and produce the final object readout for decoding into the output mask.

CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track

TL;DR

Abstract

CSS-Segment: 2nd Place Report of LSVOS Challenge VOS Track

Authors

TL;DR

Abstract

Table of Contents

Figures (1)