Learning Spatial-Semantic Features for Robust Video Object Segmentation
Xin Li, Deshui Miao, Zhenyu He, Yaowei Wang, Huchuan Lu, Ming-Hsuan Yang
TL;DR
This work tackles robust video object segmentation in long-term, challenging sequences by introducing S3, a framework that fuses semantic priors from a pre-trained ViT with spatial details via a spatial-semantic learning block and maintains discriminative target representations through a discriminative query propagation mechanism. The method combines a spatial-semantic feature generator with a dual-level memory-based target association, enabling more reliable tracking and segmentation of objects with complex parts and appearance changes. Extensive experiments on DAVIS, YoutubeVOS, MOSE, LVOS, and related benchmarks demonstrate state-of-the-art performance and strong generalization, supported by ablation studies and qualitative analyses. The approach offers practical impact for long-duration video analysis by reducing drift and improving segmentation in cluttered or deforming scenes, with code and models publicly available.
Abstract
Tracking and segmenting multiple similar objects with distinct or complex parts in long-term videos is particularly challenging due to the ambiguity in identifying target components and the confusion caused by occlusion, background clutter, and changes in appearance or environment over time. In this paper, we propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic block comprising a semantic embedding component and a spatial dependency modeling part for associating global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation to ensure effective long-term query propagation. Extensive experimental results show that the proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS2017 test (\textbf{87.8\%}), YoutubeVOS 2019 (\textbf{88.1\%}), MOSE val (\textbf{74.0\%}), and LVOS test (\textbf{73.0\%}), and demonstrate the effectiveness and generalization capacity of our model. The source code and trained models are released at \href{https://github.com/yahooo-m/S3}{https://github.com/yahooo-m/S3}.
