Table of Contents
Fetching ...

Object-Centric Learning with Slot Attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf

TL;DR

The paper tackles learning object-centric representations from raw perceptual input to enable compositional reasoning. It introduces the Slot Attention module, which maps CNN features to a permutation-invariant set of slots that iteratively compete to explain parts of the input. The authors demonstrate state-of-the-art efficiency and competitive accuracy in unsupervised object discovery compared to prior methods, and show strong results on supervised set-prediction tasks using a Hungarian matching scheme to align slots with objects. Importantly, slots learn to bind to individual objects without direct segmentation supervision, enabling generalization to more objects or unseen scene compositions. The approach holds promise for broader applications and extensions to video, other modalities, and diverse downstream tasks like visual reasoning and control.

Abstract

Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.

Object-Centric Learning with Slot Attention

TL;DR

The paper tackles learning object-centric representations from raw perceptual input to enable compositional reasoning. It introduces the Slot Attention module, which maps CNN features to a permutation-invariant set of slots that iteratively compete to explain parts of the input. The authors demonstrate state-of-the-art efficiency and competitive accuracy in unsupervised object discovery compared to prior methods, and show strong results on supervised set-prediction tasks using a Hungarian matching scheme to align slots with objects. Importantly, slots learn to bind to individual objects without direct segmentation supervision, enabling generalization to more objects or unseen scene compositions. The approach holds promise for broader applications and extensions to video, other modalities, and diverse downstream tasks like visual reasoning and control.

Abstract

Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.

Paper Structure

This paper contains 47 sections, 1 theorem, 7 equations, 23 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

Let $\textnormal{SlotAttention}(\textnormal{inputs}, \textnormal{slots})\in\mathbb{R}^{K\times D_{\textnormal{slots}}}$ be the output of the Slot Attention module (Algorithm algo:slot_attention), where $\textnormal{inputs}\in\mathbb{R}^{N\times D_{\textnormal{inputs}}}$ and $\textnormal{slots}\in\ma

Figures (23)

  • Figure 1: (a) Slot Attention module and example applications to (b) unsupervised object discovery and (c) supervised set prediction with labeled targets $y_i$. See main text for details.
  • Figure 2: (Left) Adjusted Rand Index (ARI) scores (in $\%$, mean $\pm$ stddev for 5 seeds) for unsupervised object discovery in multi-object datasets. In line with previous works greff2019multiburgess2019monetengelcke2019genesis, we exclude background labels in ARI evaluation. *denotes that one outlier was excluded from evaluation. (Right) Effect of increasing the number of Slot Attention iterations $T$ at test time (for a model trained on CLEVR6 with $T=3$ and $K=7$ slots), tested on CLEVR6 ($K=7$) and CLEVR10 ($K=11$).
  • Figure 3: (a) Visualization of per-slot reconstructions and alpha masks in the unsupervised training setting (object discovery). Top rows: CLEVR6, middle rows: Multi-dSprites, bottom rows: Tetrominoes. (b) Attention masks (attn) for each iteration, only using four object slots at test time on CLEVR6. (c) Per-iteration reconstructions and reconstruction masks (from decoder). Border colors for slots correspond to colors of segmentation masks used in the combined mask visualization (third column). We visualize individual slot reconstructions multiplied with their respective alpha mask, using the visualization script from greff2019multi.
  • Figure 4: Visualization of (per-slot) reconstructions and masks of a Slot Attention model trained on a greyscale version of CLEVR6, where it achieves $98.5\pm0.3\%$ ARI. Here, we show the full reconstruction of each slot (i.e., without multiplication with their respective alpha mask).
  • Figure 5: (Left) AP at different distance thresholds on CLEVR10 (with $K=10$). (Center) AP for the Slot Attention model with different number of iterations. The models are trained with 3 iterations and tested with iterations ranging from 3 to 7. (Right) AP for Slot Attention trained on CLEVR6 ($K=6$) and tested on scenes containing exactly $N$ objects (with $N=K$ from $6$ to $10$).
  • ...and 18 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Definition 1: Permutation Invariance
  • Definition 2: Permutation Equivariance