Object-Centric Learning with Slot Attention
Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf
TL;DR
The paper tackles learning object-centric representations from raw perceptual input to enable compositional reasoning. It introduces the Slot Attention module, which maps CNN features to a permutation-invariant set of slots that iteratively compete to explain parts of the input. The authors demonstrate state-of-the-art efficiency and competitive accuracy in unsupervised object discovery compared to prior methods, and show strong results on supervised set-prediction tasks using a Hungarian matching scheme to align slots with objects. Importantly, slots learn to bind to individual objects without direct segmentation supervision, enabling generalization to more objects or unseen scene compositions. The approach holds promise for broader applications and extensions to video, other modalities, and diverse downstream tasks like visual reasoning and control.
Abstract
Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.
