Table of Contents
Fetching ...

Boosting Object Representation Learning via Motion and Object Continuity

Quentin Delfosse, Wolfgang Stammer, Thomas Rothenbacher, Dwarak Vittal, Kristian Kersting

TL;DR

This work addresses suboptimal object encodings produced by unsupervised object detectors when used for downstream tasks. It introduces Motion and Object Continuity (MOC), a model-agnostic training scheme that couples motion cues via optical flow with a temporal contrastive loss to align object representations over time. Empirically, MOC improves both object discovery and latent encodings, yielding faster convergence, higher AMI scores, and stronger downstream performance in few-shot classification and Atari gameplay across SPACE and Slot Attention baselines. The approach delivers practical benefits by enhancing object-centric representations for reasoning-driven AI, while maintaining compatibility with existing object discovery architectures. Key results are demonstrated on the Atari-OC dataset and the Atari-OCTA evaluation framework, with a focus on robustness and transfer to downstream tasks.

Abstract

Recent unsupervised multi-object detection models have shown impressive performance improvements, largely attributed to novel architectural inductive biases. Unfortunately, they may produce suboptimal object encodings for downstream tasks. To overcome this, we propose to exploit object motion and continuity, i.e., objects do not pop in and out of existence. This is accomplished through two mechanisms: (i) providing priors on the location of objects through integration of optical flow, and (ii) a contrastive object continuity loss across consecutive image frames. Rather than developing an explicit deep architecture, the resulting Motion and Object Continuity (MOC) scheme can be instantiated using any baseline object detection model. Our results show large improvements in the performances of a SOTA model in terms of object discovery, convergence speed and overall latent object representations, particularly for playing Atari games. Overall, we show clear benefits of integrating motion and object continuity for downstream tasks, moving beyond object representation learning based only on reconstruction.

Boosting Object Representation Learning via Motion and Object Continuity

TL;DR

This work addresses suboptimal object encodings produced by unsupervised object detectors when used for downstream tasks. It introduces Motion and Object Continuity (MOC), a model-agnostic training scheme that couples motion cues via optical flow with a temporal contrastive loss to align object representations over time. Empirically, MOC improves both object discovery and latent encodings, yielding faster convergence, higher AMI scores, and stronger downstream performance in few-shot classification and Atari gameplay across SPACE and Slot Attention baselines. The approach delivers practical benefits by enhancing object-centric representations for reasoning-driven AI, while maintaining compatibility with existing object discovery architectures. Key results are demonstrated on the Atari-OC dataset and the Atari-OCTA evaluation framework, with a focus on robustness and transfer to downstream tasks.

Abstract

Recent unsupervised multi-object detection models have shown impressive performance improvements, largely attributed to novel architectural inductive biases. Unfortunately, they may produce suboptimal object encodings for downstream tasks. To overcome this, we propose to exploit object motion and continuity, i.e., objects do not pop in and out of existence. This is accomplished through two mechanisms: (i) providing priors on the location of objects through integration of optical flow, and (ii) a contrastive object continuity loss across consecutive image frames. Rather than developing an explicit deep architecture, the resulting Motion and Object Continuity (MOC) scheme can be instantiated using any baseline object detection model. Our results show large improvements in the performances of a SOTA model in terms of object discovery, convergence speed and overall latent object representations, particularly for playing Atari games. Overall, we show clear benefits of integrating motion and object continuity for downstream tasks, moving beyond object representation learning based only on reconstruction.
Paper Structure (43 sections, 24 equations, 15 figures, 33 tables)

This paper contains 43 sections, 24 equations, 15 figures, 33 tables.

Figures (15)

  • Figure 1: An object-centric reasoner playing Pong. The agent first extracts the object representation and then reasons on them to select an optimal action.
  • Figure 2: Motivational example: unsupervised object detection models are insufficient for downstream tasks such as classification, exemplified here via SPACE SPACE2020 on Atari environments. Top: Example images of SPACE detecting objects on different Atari games. Left: F-score for object detection (blue) and few shot classification accuracy of object encodings via ridge regression (orange, 64 objects per class, 0% accuracy corresponds to no object detected). Right: Two-dimensional t-SNE embedding of object encodings produced by SPACE for Space Invaders.
  • Figure 3: An overview of the MOC training scheme applied to a base object detection model, which provides location and object representations. In our MOC training scheme, (i) motion information (dark blue), is extracted from each frame, allowing to detect objects and directly update the model's latent location variables (loc). (ii) Object continuity (black + cyan) aligns the encodings (enc) of spatially close objects of consecutive frames using a contrastive loss.
  • Figure 4: MOC improves object detection. Final F-scores of SPACE models and Adjusted Random Index of Slot Attention (SLAT), both with and without MOC over frames of different Atari-OCTA games. Training via MOC leads to massive improvements over the set of investigated games. Optical flow F-scores are provided in red. They indicate the potential F-score upper-bound obtainable if using Motion supervision only.
  • Figure 5: MOC leads to more optimal object encodings as indicated via mutual information score. The adjusted mutual information of object encodings from SPACE and Slot Attention (SLAT), both with and without MOC, of Atari-OCTA are presented (mean $\pm$ std). Higher average values are better.
  • ...and 10 more figures