Object-Centric Learning with Slot Mixture Module
Daniil Kirilenko, Vitaliy Vorobyov, Alexey K. Kovalev, Aleksandr I. Panov
TL;DR
This work introduces the Slot Mixture Module (SMM), a Gaussian Mixture Model based extension to object-centric learning that represents each slot by a mean $μ_k$ and a diagonal covariance $Σ_{diag,k}$, and uses neural updates in place of traditional EM. By incorporating both cluster centers and inter-cluster distances through priors $π_k$, SMM yields richer slot representations that improve performance on set-property prediction and image reconstruction while enabling concept sampling. Across CLEVR, CLEVR-Mirror, ShapeStacks, ClevrTex, and COCO datasets, SMM achieves state-of-the-art results on stringent set-prediction thresholds and outperforms Slot Attention baselines on object discovery tasks. The approach advances robust, editable object-centric representations with potential benefits for reasoning, planning, and controllable scene editing in more complex, real-world data.
Abstract
Object-centric architectures usually apply a differentiable module to the entire feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the cluster's center in latent space serves as a slot representation. Slot Attention is an example of such a method, acting as a learnable analog of the soft k-means algorithm. Our work employs a learnable clustering method based on the Gaussian Mixture Model. Unlike other approaches, we represent slots not only as centers of clusters but also incorporate information about the distance between clusters and assigned vectors, leading to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in object-centric scenarios, achieving state-of-the-art results in the set property prediction task.
