Object-Centric Learning with Slot Mixture Module

Daniil Kirilenko; Vitaliy Vorobyov; Alexey K. Kovalev; Aleksandr I. Panov

Object-Centric Learning with Slot Mixture Module

Daniil Kirilenko, Vitaliy Vorobyov, Alexey K. Kovalev, Aleksandr I. Panov

TL;DR

This work introduces the Slot Mixture Module (SMM), a Gaussian Mixture Model based extension to object-centric learning that represents each slot by a mean $μ_k$ and a diagonal covariance $Σ_{diag,k}$, and uses neural updates in place of traditional EM. By incorporating both cluster centers and inter-cluster distances through priors $π_k$, SMM yields richer slot representations that improve performance on set-property prediction and image reconstruction while enabling concept sampling. Across CLEVR, CLEVR-Mirror, ShapeStacks, ClevrTex, and COCO datasets, SMM achieves state-of-the-art results on stringent set-prediction thresholds and outperforms Slot Attention baselines on object discovery tasks. The approach advances robust, editable object-centric representations with potential benefits for reasoning, planning, and controllable scene editing in more complex, real-world data.

Abstract

Object-centric architectures usually apply a differentiable module to the entire feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the cluster's center in latent space serves as a slot representation. Slot Attention is an example of such a method, acting as a learnable analog of the soft k-means algorithm. Our work employs a learnable clustering method based on the Gaussian Mixture Model. Unlike other approaches, we represent slots not only as centers of clusters but also incorporate information about the distance between clusters and assigned vectors, leading to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in object-centric scenarios, achieving state-of-the-art results in the set property prediction task.

Object-Centric Learning with Slot Mixture Module

TL;DR

This work introduces the Slot Mixture Module (SMM), a Gaussian Mixture Model based extension to object-centric learning that represents each slot by a mean

and a diagonal covariance

, and uses neural updates in place of traditional EM. By incorporating both cluster centers and inter-cluster distances through priors

, SMM yields richer slot representations that improve performance on set-property prediction and image reconstruction while enabling concept sampling. Across CLEVR, CLEVR-Mirror, ShapeStacks, ClevrTex, and COCO datasets, SMM achieves state-of-the-art results on stringent set-prediction thresholds and outperforms Slot Attention baselines on object discovery tasks. The approach advances robust, editable object-centric representations with potential benefits for reasoning, planning, and controllable scene editing in more complex, real-world data.

Abstract

Paper Structure (19 sections, 8 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 19 sections, 8 equations, 9 figures, 12 tables, 1 algorithm.

Introduction
Background
Slot Attention
Mixture Models
Slot Mixture Module
Experiments
Image Reconstruction Using Transformer
Set Property Prediction
Object Discovery
Concept sampling
Comparing vanilla clustering
Related Works
Conclusion and Discussion
Architecture Details
Additional results for image reconstruction
...and 4 more sections

Figures (9)

Figure 1: Visualized architectures of Slot Mixture Module (ours) and Slot Attention Module. Green color is used for steps involved in both modules. SMM involves the estimation of cluster centers (${\bm{\mu}}$), the distance between cluster centers and assigned vectors ($\boldsymbol{\sigma}$, orange steps), and prior mixture weights ($\pi$, red steps). The concatenation of ${\bm{\mu}}$ and $\boldsymbol{\sigma}$ serves as slot representations, and $\pi$ can be used to identify empty slots that do not contain information about any objects, $f({\bm{x}}, {\bm{\mu}}, \boldsymbol{\sigma})$ is the log of the Gaussian PDF. SA module estimates only centers of clusters.
Figure 2: Examples of image generation with Image GPT conditioned to different slot representations. Images in the blue borders are from the model with the Slot Attention module, and images in green borders are generated using slots from the Slot Mixture Module. Red color stands for input images.
Figure 3: Examples of all the qualitatively incorrectly generated images from the random batch of 64 samples. In 6 cases, reconstruction using Slot Attention gave a wrong order of objects (blue circle) or lost one object (red circle). In the remaining two samples, both modules gave incorrect reconstruction (green circle).
Figure 4: Example of object-discovery on ClevrTex dataset. The first column represents ground-truth images. The second one is broadcast-decoder reconstructions. The next columns are per slots of attention masks.
Figure 5: Example of editing images from ShapeStacks and Bitmoji using concept sampling. This method allows precise editing of individual concepts: the first row demonstrates the capability of changing shape, color of single objects or manipulating background. The second row gives an example of achieving similar yet slightly different hairstyles with concept sampling.
...and 4 more figures

Object-Centric Learning with Slot Mixture Module

TL;DR

Abstract

Object-Centric Learning with Slot Mixture Module

Authors

TL;DR

Abstract

Table of Contents

Figures (9)