Table of Contents
Fetching ...

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Xin Li, Deshui Miao, Zhenyu He, Yaowei Wang, Huchuan Lu, Ming-Hsuan Yang

TL;DR

This work tackles robust video object segmentation in long-term, challenging sequences by introducing S3, a framework that fuses semantic priors from a pre-trained ViT with spatial details via a spatial-semantic learning block and maintains discriminative target representations through a discriminative query propagation mechanism. The method combines a spatial-semantic feature generator with a dual-level memory-based target association, enabling more reliable tracking and segmentation of objects with complex parts and appearance changes. Extensive experiments on DAVIS, YoutubeVOS, MOSE, LVOS, and related benchmarks demonstrate state-of-the-art performance and strong generalization, supported by ablation studies and qualitative analyses. The approach offers practical impact for long-duration video analysis by reducing drift and improving segmentation in cluttered or deforming scenes, with code and models publicly available.

Abstract

Tracking and segmenting multiple similar objects with distinct or complex parts in long-term videos is particularly challenging due to the ambiguity in identifying target components and the confusion caused by occlusion, background clutter, and changes in appearance or environment over time. In this paper, we propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic block comprising a semantic embedding component and a spatial dependency modeling part for associating global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation to ensure effective long-term query propagation. Extensive experimental results show that the proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS2017 test (\textbf{87.8\%}), YoutubeVOS 2019 (\textbf{88.1\%}), MOSE val (\textbf{74.0\%}), and LVOS test (\textbf{73.0\%}), and demonstrate the effectiveness and generalization capacity of our model. The source code and trained models are released at \href{https://github.com/yahooo-m/S3}{https://github.com/yahooo-m/S3}.

Learning Spatial-Semantic Features for Robust Video Object Segmentation

TL;DR

This work tackles robust video object segmentation in long-term, challenging sequences by introducing S3, a framework that fuses semantic priors from a pre-trained ViT with spatial details via a spatial-semantic learning block and maintains discriminative target representations through a discriminative query propagation mechanism. The method combines a spatial-semantic feature generator with a dual-level memory-based target association, enabling more reliable tracking and segmentation of objects with complex parts and appearance changes. Extensive experiments on DAVIS, YoutubeVOS, MOSE, LVOS, and related benchmarks demonstrate state-of-the-art performance and strong generalization, supported by ablation studies and qualitative analyses. The approach offers practical impact for long-duration video analysis by reducing drift and improving segmentation in cluttered or deforming scenes, with code and models publicly available.

Abstract

Tracking and segmenting multiple similar objects with distinct or complex parts in long-term videos is particularly challenging due to the ambiguity in identifying target components and the confusion caused by occlusion, background clutter, and changes in appearance or environment over time. In this paper, we propose a robust video object segmentation framework that learns spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic block comprising a semantic embedding component and a spatial dependency modeling part for associating global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation to ensure effective long-term query propagation. Extensive experimental results show that the proposed method achieves state-of-the-art performance on benchmark data sets, including the DAVIS2017 test (\textbf{87.8\%}), YoutubeVOS 2019 (\textbf{88.1\%}), MOSE val (\textbf{74.0\%}), and LVOS test (\textbf{73.0\%}), and demonstrate the effectiveness and generalization capacity of our model. The source code and trained models are released at \href{https://github.com/yahooo-m/S3}{https://github.com/yahooo-m/S3}.
Paper Structure (19 sections, 6 equations, 10 figures, 9 tables)

This paper contains 19 sections, 6 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Performance comparison in terms of the $\mathcal{J}\&\mathcal{F}$ score on various datasets. The proposed method achieves favorable performance against state-of-the-art methods, including AOT_L aot, XMem xmem, and Cutie cutie on all the datasets. All methods are trained using the YoutubeVOS and DAVIS datasets.
  • Figure 2: Architecture of the proposed method. (a) shows the overall framework of the proposed method comprising a feature generation part, a target association component, and a decoder-based prediction part. The feature generation part contains multiple spatial-semantic blocks, as illustrated in (b), which learn spatial-semantic features by integrating semantic priors and spatial details. (c) illustrates the discriminative query propagation process, which learns to generate discriminative queries that represent high-level object information.
  • Figure 3: Visualized feature maps from different backbones and stages. It shows that the proposed spatial-semantic model generates effective features for target representation.
  • Figure 4: Visualized results on challenging scenarios. The proposed method performs well on these challenging sequences, while other methods suffer from semantic understanding and thus cannot segment the whole objects. (b) Our method handles the challenging sequence with fast-motion objects better than Cutie cutie .
  • Figure 5: Additional qualitative comparison against state-of-the-art methods including XMem, Swin-DeAOT, and Cutie. All the sequences are selected from YouTubeVOS 2019 and DAVIS 2017 datasets.
  • ...and 5 more figures