Table of Contents
Fetching ...

Identifiable Object-Centric Representation Learning via Probabilistic Slot Attention

Avinash Kori, Francesco Locatello, Ainkaran Santhirasekaram, Francesca Toni, Ben Glocker, Fabio De Sousa Ribeiro

TL;DR

A probabilistic slot-attention algorithm is proposed that imposes an aggregate mixture prior over object-centric slot representations, thereby providing slot identifiability guarantees without supervision, up to an equivalence relation.

Abstract

Learning modular object-centric representations is crucial for systematic generalization. Existing methods show promising object-binding capabilities empirically, but theoretical identifiability guarantees remain relatively underdeveloped. Understanding when object-centric representations can theoretically be identified is crucial for scaling slot-based methods to high-dimensional images with correctness guarantees. To that end, we propose a probabilistic slot-attention algorithm that imposes an aggregate mixture prior over object-centric slot representations, thereby providing slot identifiability guarantees without supervision, up to an equivalence relation. We provide empirical verification of our theoretical identifiability result using both simple 2-dimensional data and high-resolution imaging datasets.

Identifiable Object-Centric Representation Learning via Probabilistic Slot Attention

TL;DR

A probabilistic slot-attention algorithm is proposed that imposes an aggregate mixture prior over object-centric slot representations, thereby providing slot identifiability guarantees without supervision, up to an equivalence relation.

Abstract

Learning modular object-centric representations is crucial for systematic generalization. Existing methods show promising object-binding capabilities empirically, but theoretical identifiability guarantees remain relatively underdeveloped. Understanding when object-centric representations can theoretically be identified is crucial for scaling slot-based methods to high-dimensional images with correctness guarantees. To that end, we propose a probabilistic slot-attention algorithm that imposes an aggregate mixture prior over object-centric slot representations, thereby providing slot identifiability guarantees without supervision, up to an equivalence relation. We provide empirical verification of our theoretical identifiability result using both simple 2-dimensional data and high-resolution imaging datasets.
Paper Structure (32 sections, 10 theorems, 32 equations, 15 figures, 4 tables, 3 algorithms)

This paper contains 32 sections, 10 theorems, 32 equations, 15 figures, 4 tables, 3 algorithms.

Key Result

Lemma 1

Given that probabilistic slot attention induces a local (per-datapoint $\mathbf{x} \in \{ \mathbf{x}_i\}_{i=1}^M$) GMM with $K$ components, the aggregate posterior $q(\mathbf{z})$ obtained by marginalizing out $\mathbf{x}$ is a non-degenerate global Gaussian mixture with $MK$ components:

Figures (15)

  • Figure 1: Probabilistic slot attention and the identifiable aggregate slot posterior. (Left) Slot posterior GMMs per datapoint (local) and the induced aggregate posterior GMM (global). (Right) Sampling slot representations from the aggregate slot posterior is tractable.
  • Figure 2: Graphical models of probabilistic slot attention. (a) Stochastic encoder of standard slot attention locatello2020object with $T$ attention iterations. (b) Proposed model -- each image in the dataset $\{\mathbf{x}_i\}_{i=1}^M$ is encoded into a respective latent representation $\mathbf{z} \in \mathbb{R}^{N\times d}$, to which a (local) Gaussian mixture model with $K$ components is fit via expectation maximisation. The resulting $K$ Gaussians serve as slot posterior distributions: $\mathbf{s}_k \sim \mathcal{N}(\mathbf{s}_k; \boldsymbol{\mu}_k, \boldsymbol{\sigma}_k^2)$, for $k=1,\dots,K$. (c) Aggregate posterior distribution obtained by marginalizing out the data: $q(\mathbf{z}) = \sum_{i=1}^M q(\mathbf{z} \mid \mathbf{x}_i)/M$. We prove that $q(\mathbf{z})$ is a tractable, non-degenerate Gaussian mixture distribution which: (i) serves as the theoretically optimal prior over slots; (ii) is empirically stable across runs (i.e. identifiable up to an affine transformation and slot permutation); (iii) can be tractably sampled from and used for scene composition tasks.
  • Figure 2: Comparing slot identifiability scores (SMCC and slot averaged R2) with existing object-centric learning methods.
  • Figure 3: Aggregate Gaussian Mixture Density. Examples of aggregate posterior mixtures. For each plot, we compute the aggregate mixture (red line) based on three random bimodal Gaussian mixtures, and plot the respective densities. The three GMMs here are analogous to the local GMMs obtained from probabilistic slot attention (Algorithm \ref{['alg:the_alg']}), and the aggregate GMM represents $q(\mathbf{z})$.
  • Figure 4: Aggregate posterior identifiability. Recovered (latent) aggregate posteriors $q(\mathbf{z})$ across 5 runs of our PSA model. As detailed in Section \ref{['subsec:qualitative_analysis']}, we used a 2D synthetic dataset with 5 total 'object' clusters, with each observation containing at most 3. This provides strong evidence of recovery of the latent space up to affine transformations, empirically verifying our identifiability claim.
  • ...and 10 more figures

Theorems & Definitions (29)

  • Definition 1: Compositional Contrast
  • Definition 2: Identifiability.
  • Remark 1
  • Definition 3: $\sim_{s}$-equivalence
  • Lemma 1: Aggregate Posterior Mixture
  • Theorem 1: Mixture Distribution of Concatenated Slots
  • Theorem 2: $\sim_s$-Identifiable Slot Representations
  • Corollary 3: Individual Slot Identifiability
  • Remark 2
  • Remark 3
  • ...and 19 more