DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

Jiapeng Tang; Yinyu Nie; Lev Markhasin; Angela Dai; Justus Thies; Matthias Nießner

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, Matthias Nießner

TL;DR

DiffuScene presents a diffusion-based approach for generating diverse, realistic indoor scenes by modeling an unordered set of object attributes (semantics, placements, and geometry) and jointly denoising them with an attention-enhanced 1D denoiser. A geometry-feature diffusion strategy paired with shape-code retrieval enables plausible object arrangements and symmetric relations, improving inter-object coherence. The model supports scene completion, scene rearrangement, and text-conditioned scene synthesis, and demonstrates superior diversity and realism on 3D-FRONT against state-of-the-art baselines, with extensive ablations validating design choices. This work advances 3D generative modeling by combining set-based diffusion with geometry-aware retrieval and multi-attribute denoising, enabling practical applications in design, VR, and content creation.

Abstract

We present DiffuScene for indoor 3D scene synthesis based on a novel scene configuration denoising diffusion model. It generates 3D instance properties stored in an unordered object set and retrieves the most similar geometry for each object configuration, which is characterized as a concatenation of different attributes, including location, size, orientation, semantics, and geometry features. We introduce a diffusion network to synthesize a collection of 3D indoor objects by denoising a set of unordered object attributes. Unordered parametrization simplifies and eases the joint distribution approximation. The shape feature diffusion facilitates natural object placements, including symmetries. Our method enables many downstream applications, including scene completion, scene arrangement, and text-conditioned scene synthesis. Experiments on the 3D-FRONT dataset show that our method can synthesize more physically plausible and diverse indoor scenes than state-of-the-art methods. Extensive ablation studies verify the effectiveness of our design choice in scene diffusion models.

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

TL;DR

Abstract

Paper Structure (50 sections, 13 equations, 18 figures, 6 tables)

This paper contains 50 sections, 13 equations, 18 figures, 6 tables.

Introduction
Related work
Traditional Scene Modeling and Synthesis
Learning-based Generative Scene Synthesis
3D Diffusion Models
DiffuScene
Object Set Diffusion
Diffusion process.
Generative process.
Denoising network.
Training objective.
Applications
Scene completion.
Scene re-arrangement.
Text-conditioned scene synthesis.
...and 35 more sections

Figures (18)

Figure 1: We present DiffuScene, a diffusion model for diverse and realistic indoor scene synthesis. It facilitates various downstream applications: scene completion from partial scenes (left); scene arrangements of given objects (middle); scene generation from a text prompt describing partial scene configurations. (right).
Figure 2: Overview. Given a 3D scene $\mathcal{S}$ of $N$ objects, we represent it as an unordered set $\Vec{x}_0=\{ \Vec{o}_i \}_{i=1}^{N}$, by parametrizing each object $\Vec{o}_i$ as a vector storing all object attributes i.e., location $\Vec{l}_i$, size $\Vec{s}_i$, orientation $\theta_i$, class label $\Vec{c}_i$, and latent shape code $\Vec{f}_i$. Based on a set of all possible $\Vec{x}_0$, we propose DiffuScene, a denoising diffusion probabilistic model for 3D scene generation. In the forward process, we gradually add noise to $\Vec{x}_0$ until we obtain a standard Gaussian noise $\Vec{x}_T$. In the reverse process i.e. generative process, a denoising network iteratively cleans the noisy scene using ancestral sampling. Finally, we use the denoised class labels and shape latent codes to perform shape retrieval, and place object geometries through denoised locations, sizes, and orientations.
Figure 3: The denoising network architecture takes the attributes of multiple objects (bounding box, object class, geometry code) as input and denoises them using 1D convolutions with skip connections and attention blocks.
Figure 4: Unconditional scene synthesis. We compare our method with the state-of-the-art by generating from random noises, where our results present higher diversity and better plausibility with fewer penetration issues and more symmetric pairs.
Figure 5: (b) w/ shape diffusion captures symmetries vs. (a) w/o. The shape latent diffusion promotes symmetry discovery.
...and 13 more figures

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

TL;DR

Abstract

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (18)