Table of Contents
Fetching ...

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin

TL;DR

SMTC introduces a self-supervised framework that unifies high-level semantic discrimination with low-level temporal correspondence to produce object-centric representations from video. It uses a two-stage semantic-aware masked slot attention with $N$ Gaussian components ($\mu_n$, $\sigma_n$) to first decompose semantics and then distinguish instances within each semantic group, guided by temporal consistency losses and a teacher–student training scheme. Dense semantic alignment via optimal transport and instance-consistency regularization enable robust, occlusion-tolerant object discovery and state-of-the-art label propagation without motion or depth priors. The approach demonstrates that simple RGB features plus temporal cues suffice to achieve discriminative, temporally coherent object-centric representations for both single- and multi-object scenarios in video.

Abstract

Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. The code is released at https://github.com/shvdiwnkozbw/SMTC.

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

TL;DR

SMTC introduces a self-supervised framework that unifies high-level semantic discrimination with low-level temporal correspondence to produce object-centric representations from video. It uses a two-stage semantic-aware masked slot attention with Gaussian components (, ) to first decompose semantics and then distinguish instances within each semantic group, guided by temporal consistency losses and a teacher–student training scheme. Dense semantic alignment via optimal transport and instance-consistency regularization enable robust, occlusion-tolerant object discovery and state-of-the-art label propagation without motion or depth priors. The approach demonstrates that simple RGB features plus temporal cues suffice to achieve discriminative, temporally coherent object-centric representations for both single- and multi-object scenarios in video.

Abstract

Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. The code is released at https://github.com/shvdiwnkozbw/SMTC.
Paper Structure (14 sections, 12 equations, 4 figures, 8 tables)

This paper contains 14 sections, 12 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Fig. \ref{['semantic']} presents the results of query slot attention on top of the RGB feature map. It successfully decomposes different semantics, e.g., camels and fence. Fig. \ref{['correspondence']} visualizes the correspondence map after PCA dimensionality reduction, showing that different instances have different correspondence patterns. And the slot attention with random sampling coarsely distinguishes two camels with some redundant borders. Best viewed in color.
  • Figure 2: An overview of our framework. We first extract frame-wise features and calculate dense feature correlation, then fuse them to pass through semantic-aware masked slot attention, which comprises two slot attention stages with $N$ shared learnable Gaussian distributions. In the first semantic slot attention stage, the mean vectors serve as slot initialization to generate a set of segmentation masks for semantic decomposition. In the second masked slot attention stage, which runs on $N$ semantics in parallel, we randomly sample slots from the Gaussian distribution of each semantics and perform masked feature aggregation within the semantic area to identify distinct instances. We enforce semantic and instance temporal consistency to train the architecture in a teacher-student manner, with the teacher marked in gray.
  • Figure 3: Visualization of semantic and instance segmentation map. The red boxes outline the ambiguous areas.
  • Figure 4: Visualization of instance alignment. The arrows point out the matched instances across time.