Table of Contents
Fetching ...

Slot-VAE: Object-Centric Scene Generation with Slot Attention

Yanbo Wang, Letao Liu, Justin Dauwels

TL;DR

Slot-VAE addresses unsupervised generation of coherent multi-object scenes by combining slot attention with a two-layer hierarchical VAE. It learns a global scene latent $\mathbf{z}^g$ and per-object latents $\mathbf{z}_{1:K}^{s}$, with the latter generated from $\mathbf{z}^g$ to enforce structural coherence, while aligning the slot orders between inference and generation through shared slot-attention modules. The training objective blends the ELBO with an auxiliary prior to encourage object-centric disentanglement, and a beta-weighted KL term stabilizes learning. Empirical results on ObjectRoom, ShapeStacks, and Arrow Room show Slot-VAE delivers sharper samples and better scene-structure fidelity than baselines, along with clear object- and attribute-level disentanglement. The work enables controllable, unsupervised scene generation and editing, with potential applications in artwork, scene understanding, and data augmentation, while noting current decoder limitations and avenues for more scalable decoders in future work.

Abstract

Slot attention has shown remarkable object-centric representation learning performance in computer vision tasks without requiring any supervision. Despite its object-centric binding ability brought by compositional modelling, as a deterministic module, slot attention lacks the ability to generate novel scenes. In this paper, we propose the Slot-VAE, a generative model that integrates slot attention with the hierarchical VAE framework for object-centric structured scene generation. For each image, the model simultaneously infers a global scene representation to capture high-level scene structure and object-centric slot representations to embed individual object components. During generation, slot representations are generated from the global scene representation to ensure coherent scene structures. Our extensive evaluation of the scene generation ability indicates that Slot-VAE outperforms slot representation-based generative baselines in terms of sample quality and scene structure accuracy.

Slot-VAE: Object-Centric Scene Generation with Slot Attention

TL;DR

Slot-VAE addresses unsupervised generation of coherent multi-object scenes by combining slot attention with a two-layer hierarchical VAE. It learns a global scene latent and per-object latents , with the latter generated from to enforce structural coherence, while aligning the slot orders between inference and generation through shared slot-attention modules. The training objective blends the ELBO with an auxiliary prior to encourage object-centric disentanglement, and a beta-weighted KL term stabilizes learning. Empirical results on ObjectRoom, ShapeStacks, and Arrow Room show Slot-VAE delivers sharper samples and better scene-structure fidelity than baselines, along with clear object- and attribute-level disentanglement. The work enables controllable, unsupervised scene generation and editing, with potential applications in artwork, scene understanding, and data augmentation, while noting current decoder limitations and avenues for more scalable decoders in future work.

Abstract

Slot attention has shown remarkable object-centric representation learning performance in computer vision tasks without requiring any supervision. Despite its object-centric binding ability brought by compositional modelling, as a deterministic module, slot attention lacks the ability to generate novel scenes. In this paper, we propose the Slot-VAE, a generative model that integrates slot attention with the hierarchical VAE framework for object-centric structured scene generation. For each image, the model simultaneously infers a global scene representation to capture high-level scene structure and object-centric slot representations to embed individual object components. During generation, slot representations are generated from the global scene representation to ensure coherent scene structures. Our extensive evaluation of the scene generation ability indicates that Slot-VAE outperforms slot representation-based generative baselines in terms of sample quality and scene structure accuracy.
Paper Structure (13 sections, 7 equations, 12 figures, 3 tables)

This paper contains 13 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Slot-VAE overview. The image $\mathbf{x}$ is passed through a CNN module. The obtained image features go through two paths in parallel. On the first path, the obtained image features are input into a slot attention module to learn slot representations $\{\mathbf{s}_{k}'\}_{k=1}^K$. From slots $\{\mathbf{s}_k'\}_{k=1}^K$, latent vectors $\{\mathbf{z}_k^{s'}\}_{k=1}^K$ are inferred. Then, a shared decoder decodes the individual object latent vector $\{\mathbf{z}_k^{s'}\}_{k=1}^K$ into object masks $\boldsymbol{\pi}_{1:K}$ and object components $\mathbf{x}_{1:K}$. By combining $\mathbf{x}_{1:K}$ with $\boldsymbol{\pi}_{1:K}$, the input $\textbf{x}$ is reconstructed. On the second path, the obtained image features is encoded into a global latent vector $\textbf{z}^g$. From $\textbf{z}^g$, a feature map is built and fed into a slot attention module to generate slot representations $\{\mathbf{s}_k\}_{k=1}^K$. From $\{\mathbf{s}_k\}_{k=1}^K$, latent vectors $\{\mathbf{z}_k^{s}\}_{k=1}^K$ are inferred. The two paths use the same slot attention module and share weights and initialization values, and it requires $\{\mathbf{z}_k^{s'}\}_{k=1}^K$ and $\{\mathbf{z}_k^{s}\}_{k=1}^K$ to be as close as possible during training measured with KL divergence.
  • Figure 2: Image decompostion and reconstruction performance on the ObjectsRoom dataset.
  • Figure 3: Image decompostion and reconstruction performance on the ShapeStacks dataset.
  • Figure 4: Image decompostion and reconstruction performance on the Arrow Room dataset.
  • Figure 5: Datasets and generation examples of Slot-VAE and baselines.
  • ...and 7 more figures