CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion

Guangyao Zhai; Evin Pınar Örnek; Shun-Cheng Wu; Yan Di; Federico Tombari; Nassir Navab; Benjamin Busam

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion

Guangyao Zhai, Evin Pınar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, Benjamin Busam

TL;DR

This work tackles controllable 3D scene synthesis guided by scene graphs and introduces CommonScenes, a fully generative model with a Layout Branch (CVAE-based layout regression) and a Shape Branch (latent diffusion conditioned on graph relations) to produce semantically coherent and diverse indoor scenes. By evolving scene graphs into a Box-Enhanced Contextual Graph and propagating context through a triplet-GCN, the model learns a joint layout-shape distribution $Z \sim \mathcal{N}(\mu,\sigma)$ and uses cross-attention in diffusion to respect global and local relationships. The authors also create SG-FRONT, a synthetic indoor dataset providing high-quality scene-graph labels on top of 3D-FRONT, enabling robust benchmarking; experiments show that CommonScenes outperforms baselines in generation consistency, quality, and diversity. The approach promises practical impact for interactive environments in robotics, VR/AR, and content creation, with code and SG-FRONT to be released upon acceptance.

Abstract

Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships in the scene graph while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion

TL;DR

and uses cross-attention in diffusion to respect global and local relationships. The authors also create SG-FRONT, a synthetic indoor dataset providing high-quality scene-graph labels on top of 3D-FRONT, enabling robust benchmarking; experiments show that CommonScenes outperforms baselines in generation consistency, quality, and diversity. The approach promises practical impact for interactive environments in robotics, VR/AR, and content creation, with code and SG-FRONT to be released upon acceptance.

Abstract

Paper Structure (48 sections, 7 equations, 14 figures, 10 tables)

This paper contains 48 sections, 7 equations, 14 figures, 10 tables.

Introduction
Related Work
Scene Graph
Indoor 3D Scene Synthesis
Denoising Diffusion Models
Preliminaries
Scene Graph
Conditional Latent Diffusion Model
Method
Overview
Scene Graph Evolution
Contextual Graph
Box-Enhanced Contextual Graph
Graph Encoding
Layout Branch
...and 33 more sections

Figures (14)

Figure 1: Architecture Comparison (Upper Row): Compared with previous methods, our fully generative model requires neither databases nor multiple category-level decoders. Performance Comparison (Bottom Row): We demonstrate the effectiveness of encapsulating scene-object and object-object relationships. The semantic information from the scene graph is 'a table is surrounded by three chairs'. As highlighted in the rounded rectangles, through the scene-object relationship, our network outperforms other methods by generating a round table and three evenly distributed chairs. Through the object-object relationship, the three chairs are consistent in style. Moreover, our method still preserves the object diversity (blue dashed rectangle).
Figure 2: Scene Graph Evolution. Take the features of two nodes Bed $(o_i)$, Table $(o_j)$ and the linked edge In front of $(\tau_{i\to j})$ as an example, where $(o_i, o_j), \tau_{i\to j}$ are embedded learnable node features and the edge feature, respectively. We enhance the node and edge features with CLIP feature $p_i, p_{i\to j}$ to obtain B. Contextual Graph. Then, we parameterize the ground truth bounding box $b_i$ to the node to further build C. Box-Enhanced Contextual Graph with node and edge feature represented as $f_{v_c}^{(b)i}=\{p_i, o_i, b_i\}, f_{e_c}^{i\to j}=\{p_{i\to j}, \tau_{i\to j}\}$.
Figure 3: Overview of CommonScenes. Our pipeline consists of shared modules and two collaborative branches Layout Branch and Shape Branch. Given a BCG (Figure \ref{['fig:contextual']}.C), we first feed it into $E_c$, yielding a joint layout-shape distribution $Z$. We sample $z_i$ from $Z$ for each node, obtaining concatenated feature $\{z_i, p_i, o_i\}$ with CLIP feature $p_i$ and self-updated feature $o_i$. A graph manipulator is then optionally adopted to manipulate the graph for data augmentation. Next, the updated contextual graph is fed into the layout branch and shape branch for layout regression and shape generation respectively. In the shape branch, we leverage $E_r$ to encapsulate global scene-object and local object-object relationships into graph nodes, which are then conditioned to $\varepsilon_\theta$ in LDM via cross-attention mechanism to generate ${\bf x}_0$ back in $T$ steps. Finally, a frozen shape decoder (VQ-VAE) reconstructs $S'$ using ${\bf x}_0$. The final scene is generated by fitting $S'$ to layouts.
Figure 4: Qualitative comparison The orientations of Left/Right and Front/Behind in the scene graph align with the top-down view. Both scene-object and object-object inconsistencies are highlighted in red rectangles. Green rectangles emphasize the commonsense consistency our method produces.
Figure 5: Consistency co-exists with diversity in different rounds. Our generated objects show diversity when activated twice while preserving the shape consistency within the scene (chairs in a suit).
...and 9 more figures

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion

TL;DR

Abstract

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (14)