Slot-VAE: Object-Centric Scene Generation with Slot Attention
Yanbo Wang, Letao Liu, Justin Dauwels
TL;DR
Slot-VAE addresses unsupervised generation of coherent multi-object scenes by combining slot attention with a two-layer hierarchical VAE. It learns a global scene latent $\mathbf{z}^g$ and per-object latents $\mathbf{z}_{1:K}^{s}$, with the latter generated from $\mathbf{z}^g$ to enforce structural coherence, while aligning the slot orders between inference and generation through shared slot-attention modules. The training objective blends the ELBO with an auxiliary prior to encourage object-centric disentanglement, and a beta-weighted KL term stabilizes learning. Empirical results on ObjectRoom, ShapeStacks, and Arrow Room show Slot-VAE delivers sharper samples and better scene-structure fidelity than baselines, along with clear object- and attribute-level disentanglement. The work enables controllable, unsupervised scene generation and editing, with potential applications in artwork, scene understanding, and data augmentation, while noting current decoder limitations and avenues for more scalable decoders in future work.
Abstract
Slot attention has shown remarkable object-centric representation learning performance in computer vision tasks without requiring any supervision. Despite its object-centric binding ability brought by compositional modelling, as a deterministic module, slot attention lacks the ability to generate novel scenes. In this paper, we propose the Slot-VAE, a generative model that integrates slot attention with the hierarchical VAE framework for object-centric structured scene generation. For each image, the model simultaneously infers a global scene representation to capture high-level scene structure and object-centric slot representations to embed individual object components. During generation, slot representations are generated from the global scene representation to ensure coherent scene structures. Our extensive evaluation of the scene generation ability indicates that Slot-VAE outperforms slot representation-based generative baselines in terms of sample quality and scene structure accuracy.
