Table of Contents
Fetching ...

Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Andrii Zadaianchuk, Maximilian Seitzer, Georg Martius

TL;DR

This work proposes a novel way to use pre-trained self-supervised features in the form of a temporal feature similarity loss that encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery.

Abstract

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

TL;DR

This work proposes a novel way to use pre-trained self-supervised features in the form of a temporal feature similarity loss that encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery.

Abstract

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.
Paper Structure (50 sections, 9 equations, 20 figures, 18 tables)

This paper contains 50 sections, 9 equations, 20 figures, 18 tables.

Figures (20)

  • Figure 1: We propose a self-supervised temporal similarity loss for training object-centric video models. For each patch at time $t$, the model has to predict a distribution $\hat{{\bm{P}}}_{t, t+k}$ indicating where all semantically-similar patches have moved to $k$ steps into the future. The target distribution ${\bm{P}}_{t, t+k}$ is computed with a softmax on the affinity matrix ${\bm{A}}_{t, t+k}$ containing the cosine distance between all patch features ${\bm{h}}_t$, ${\bm{h}}_{t+k}$. The loss incentivizes the model to group areas with consistent motion and semantics into slots.
  • Figure 2: Overview of VideoSAUR. Object slots $s_t$ are extracted from patch features ${\bm{h}}_t$ of a self-supervised ViT using time-recurrent slot attention, conditional on slots from the previous time step $t-1$. The model is trained by reconstructing the patch features ${\bm{h}}_t$ of the current frame $x_t$, and by predicting the similarity distribution over patches of a future frame $x_{t+k}$ (see also \ref{['fig:feature-sim-loss']}). The predictions ${\bm{y}}_t^\text{rec}$ and ${\bm{y}}_t^{\text{sim}}$ are decoded efficiently using SlotMixer decoder.
  • Figure 2: Loss ablation on MOVi-C.
  • Figure 3: Affinity matrix ${\bm{A}}_{t, t+k}$ and transition probabilities ${\bm{P}}_{t, t+k}$ values between patches (marked by purple and green) of the frame ${\bm{x}}_t$ and patches of the future frame ${\bm{x}}_{t+k}$ in MOVi-C (left) and YT-VIS (right). Red indicates maximum affinity/probability. Also see \ref{['fig:affinity_additional']} for more examples, and https://martius-lab.github.io/videosaur/ for an interactive visualization of temporal feature similarities.
  • Figure 4: Example predictions of VideoSAUR compared to recent video object-centric methods.
  • ...and 15 more figures