Grounded Object Centric Learning

Avinash Kori; Francesco Locatello; Fabio De Sousa Ribeiro; Francesca Toni; Ben Glocker

Grounded Object Centric Learning

Avinash Kori, Francesco Locatello, Fabio De Sousa Ribeiro, Francesca Toni, Ben Glocker

TL;DR

Grounded Object Centric Learning tackles the binding and generalization challenges of object-centric representations by introducing Grounded Slot Dictionary (GSD) and Conditional Slot Attention (CoSA). GSD grounds slots in canonical object properties and priors, enabling specialized, invariant bindings to object types, while a spectral abstraction dynamically estimates the number of objects to represent, reducing unnecessary computation. The authors formulate end-to-end variational objectives (ELBO) for object discovery and reasoning, and demonstrate benefits across scene generation, composition, and cross-domain visual reasoning with strong empirical results and ablations. This approach advances robust, reusable, and interpretable object-centric representations with improved generalization and efficiency, potentially enabling scalable unsupervised scene understanding and reasoning.

Abstract

The extraction of modular object-centric representations for downstream tasks is an emerging area of research. Learning grounded representations of objects that are guaranteed to be stable and invariant promises robust performance across different tasks and environments. Slot Attention (SA) learns object-centric representations by assigning objects to \textit{slots}, but presupposes a \textit{single} distribution from which all slots are randomly initialised. This results in an inability to learn \textit{specialized} slots which bind to specific object types and remain invariant to identity-preserving changes in object appearance. To address this, we present \emph{\textsc{Co}nditional \textsc{S}lot \textsc{A}ttention} (\textsc{CoSA}) using a novel concept of \emph{Grounded Slot Dictionary} (GSD) inspired by vector quantization. Our proposed GSD comprises (i) canonical object-level property vectors and (ii) parametric Gaussian distributions, which define a prior over the slots. We demonstrate the benefits of our method in multiple downstream tasks such as scene generation, composition, and task adaptation, whilst remaining competitive with SA in popular object discovery benchmarks.

Grounded Object Centric Learning

TL;DR

Abstract

Paper Structure (51 sections, 4 theorems, 34 equations, 30 figures, 21 tables, 2 algorithms)

This paper contains 51 sections, 4 theorems, 34 equations, 30 figures, 21 tables, 2 algorithms.

Introduction
Related Work
Background
Unsupervised Conditional Slot Attention: Formalism
Experiments
Case study 1: Object discovery & composition
Case study 2: Visual Reasoning & Generalizability
Conclusion
Broader Impact
Acknowledgements
Appendix
Assumptions
Proofs
Proposition 1: (Object Discovery - ELBO formulation):
Proposition 2: (ELBO formulation for reasoning task):
...and 36 more sections

Key Result

Proposition 1

Under a categorical distribution over our discrete latent variables $\tilde{{\mathbf{z}}}$, and the object-level prior distributions $p(\mathbf{s}^0_i) = \mathcal{N}\left(\mathbf{s}^0_i; \boldsymbol{\mu}_i, \boldsymbol{\sigma}^2_i \right)$ contained in $\mathfrak{S}^2$, we show that variational low where $\mathbf{s} \coloneqq \prod_{t=1}^T \mathcal{H}_{\theta}\left({\mathbf{s}}^{t-1} \mid f(g(\ha

Figures (30)

Figure 1: The leftmost block illustrates various scenes within an environment, each featuring different object instances. In the middle block, we depict our acquired grounded vocabulary of canonical object-centric representations, effectively capturing object types. The rightmost block displays a collection of specializedslot distributions associated with their respective canonical representations. These distributions are employed to sample initial slots for object instances within a scene. This process, known as object binding, is elucidated by the placeholder slots $s_1$ and $s_2$. These slots are linked to specific object types in the environment and undergo further refinement. Notably, this differs from the SA, which relies on a single distribution for random slot initialization and does not encourage slots to remain invariant in the face of identity-preserving changes in object appearance.
Figure 2: CoSA is an unsupervised autoencoder framework for grounded object-centric representation learning, and it is composed of five unique sub-modules. The abstraction module extracts all the distinct objects in a scene using spectral decomposition. The grounded slot dictionary (GSD) module maps the object representation to grounded (canonical) slot representations, which are then used for sampling initial slot conditions. The refinement module uses slot attention to iteratively refine the initial slot representations. The discovery module maps the slot representations to observational space (used for object discovery and visual scene composition). The reasoning module involves object property transformation and the prediction model (used for reasoning tasks).
Figure 3: GSD binding: we can observe that cheeks being bound to $\mathfrak{S}^1_7$, forehead to $\mathfrak{S}^1_{14}$, eyes to $(\mathfrak{S}^1_{25})$, and facial hair to $\mathfrak{S}^1_{55}$, illustrating the notion of object binding achieved in GSD, in the case of bitmoji dataset for CoSA model trained with cosine sampling stratergy.
Figure 4: Oject discovery: reconstruction quality and dynamic slot number selection for CoSA-Cosine on CLEVR and Bitmoji, with an MAE of 2.06 over slot number estimation for CLEVR.
Figure 5: Top and bottom left illustrates the randomly prompted slots and their composition. Right demonstrates object discovery results of CoSA on COCO dataset.
...and 25 more figures

Theorems & Definitions (14)

Definition 1
Proposition 1: ELBO for Object Discovery
Proposition 2: ELBO for Reasoning Tasks
Remark 1
Remark 2
Remark 3
Remark 4
proof
proof
Proposition 3
...and 4 more

Grounded Object Centric Learning

TL;DR

Abstract

Grounded Object Centric Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (30)

Theorems & Definitions (14)