Table of Contents
Fetching ...

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Shuangrui Ding, Rui Qian, Haohang Xu, Dahua Lin, Hongkai Xiong

TL;DR

This work tackles unsupervised video object segmentation by exploiting emergent objectness in DINO attention maps. A single spatio-temporal Transformer block processes frame-wise DINO features to learn robust spatio-temporal correspondences, which are then leveraged by hierarchical clustering to produce segmentation masks in a fully self-supervised, RGB-only setting. The model is trained with semantic and motion consistency losses, weighted by patch-level entropy, while the DINO backbone remains frozen, resulting in a lightweight ~1.6M-parameter temporal correlator. Empirically, the approach achieves state-of-the-art results on MOVi-E, DAVIS-17-Unsupervised, and YouTube-VIS-19, and demonstrates strong generalization and robustness across multi-object and occlusion scenarios.

Abstract

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and particularly excels in complex real-world multi-object video segmentation tasks such as DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will be released at https://github.com/shvdiwnkozbw/SSL-UVOS.

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

TL;DR

This work tackles unsupervised video object segmentation by exploiting emergent objectness in DINO attention maps. A single spatio-temporal Transformer block processes frame-wise DINO features to learn robust spatio-temporal correspondences, which are then leveraged by hierarchical clustering to produce segmentation masks in a fully self-supervised, RGB-only setting. The model is trained with semantic and motion consistency losses, weighted by patch-level entropy, while the DINO backbone remains frozen, resulting in a lightweight ~1.6M-parameter temporal correlator. Empirically, the approach achieves state-of-the-art results on MOVi-E, DAVIS-17-Unsupervised, and YouTube-VIS-19, and demonstrates strong generalization and robustness across multi-object and occlusion scenarios.

Abstract

In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and particularly excels in complex real-world multi-object video segmentation tasks such as DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will be released at https://github.com/shvdiwnkozbw/SSL-UVOS.
Paper Structure (16 sections, 9 equations, 6 figures, 9 tables)

This paper contains 16 sections, 9 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Attention leaks the object's position! We visualize the self-attention maps of different queries (prompt) from the video sequence (a). The frame-wise DINO attention maps (b) highlight image regions corresponding to the queried object. A randomly initialized spatio-temporal Transformer block on top of DINO produces noisy spatio-temporal attention maps (c) that coarsely track objects over time. Our method diminishes noise in the learned spatio-temporal attention maps (d) which derive temporally coherent object segmentation.
  • Figure 2: Visualizations of the clustering results. The left column is the results of DINO features $F$, the right column is the results of DINO attention $A$. $F$ results in much noisier clusters, while $A$ distinguishes different classes of objects.
  • Figure 3: Our architecture BA overview. Given the video frames and a DINO pretrained Transformer, we first use a temporal correlator to construct the spatio-temporal correspondence. We then utilize these attention maps as a clustering metric and apply hierarchical clustering across all frames to generate segmentation masks. During training, for each patch, we sample a positive/negative set and assign an importance weight based on its corresponding attention map. We promote alignment within the positive set while differentiating the representations from the negative set. The final loss, normalized with importance weights, only trains the temporal correlator with the DINO ViT remaining frozen.
  • Figure 4: Qualitative comparison with DINO caron2021emerging and SMTC qian2023semantics on DAVIS-17-Unsupervised. Different color denotes different clusters.
  • Figure 5: Visualization of the clustering process. We observe interpretable clustering hierarchies that segment objects at different granularities.
  • ...and 1 more figures