Table of Contents
Fetching ...

Temporally Consistent Object-Centric Learning by Contrasting Slots

Anna Manasyan, Maximilian Seitzer, Filip Radovic, Georg Martius, Andrii Zadaianchuk

TL;DR

This work tackles unsupervised object-centric learning from videos, where maintaining temporally consistent object slots is crucial for downstream tasks. It introduces Slot Contrast, a temporal contrastive loss operating on slot representations across consecutive frames and across a batch, combined with learned slot initialization and adapted DINOv2 features to enforce coherence and improve object discovery. The approach yields state-of-the-art temporal consistency and object discovery on MOVi-C, MOVi-E, and YouTube-VIS, and supports unsupervised object dynamics prediction via SlotFormer. It also demonstrates robustness to occlusions and results in sparser, more faithful slot allocations, highlighting the practical potential for autonomous control and video understanding in real-world data.

Abstract

Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.

Temporally Consistent Object-Centric Learning by Contrasting Slots

TL;DR

This work tackles unsupervised object-centric learning from videos, where maintaining temporally consistent object slots is crucial for downstream tasks. It introduces Slot Contrast, a temporal contrastive loss operating on slot representations across consecutive frames and across a batch, combined with learned slot initialization and adapted DINOv2 features to enforce coherence and improve object discovery. The approach yields state-of-the-art temporal consistency and object discovery on MOVi-C, MOVi-E, and YouTube-VIS, and supports unsupervised object dynamics prediction via SlotFormer. It also demonstrates robustness to occlusions and results in sparser, more faithful slot allocations, highlighting the practical potential for autonomous control and video understanding in real-world data.

Abstract

Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.

Paper Structure

This paper contains 48 sections, 8 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Slot Contrast model architecture overview. For each frame, we extract patch features $h_t$ using DINOv2 ViT. These features are then used to update the previously initialized or predicted slots, resulting in new slots $S_t$. The model is trained by contrasting the current frame's slots $S_t$ with the slots from the previous frame $S_{t-1}$, and by reconstructing the patch features $h_t$.
  • Figure 2: Overview of the losses used in Slot Contrast. (a) Our proposed temporal consistency objective, slot-slot contrastive loss, operates on a batch of video sequences by enforcing temporal alignment across object slots. For each frame in the sequence, the model groups object features into specific slot representations ${S_{t}^i}$. The slot-slot contrastive loss then enforces temporal consistency by drawing the corresponding slot representations from adjacent frames closer, while simultaneously pushing apart all other slot representations in the batch---whether they come from different objects within the same video or from objects in other videos. (b) The feature reconstruction loss ensures informativeness of the learned slots by using them to reconstruct original DINOv2 features with an MLP decoder.
  • Figure 3: Qualitative comparison with VideoSAURv2 on YouTube-VIS dataset. In challenging situations (e.g., almost full occlusions at $t=24$ of the 1st video and $t=14$ of the 2nd video), VideoSAURv2 reassigns slots to different objects (pink arrows), whereas Slot Contrast consistently assigns slots to the same object (green arrows). Note that the colors of the masks are matched manually for better visual comparison.
  • Figure 4: Comparison of the Feature Reconstruction (Feat. Rec.) baseline, the slot-slot contrastive loss using only slots from the same video as the contrast set (Intra-video Contrast), and Slot Contrast on the MOVi-C dataset.
  • Figure 5: Object dynamics prediction task on MOVi-C using Slot Contrast slots using SlotFormer wu2023slotformer.
  • ...and 12 more figures