Table of Contents
Fetching ...

SPATIALGEN: Layout-guided 3D Indoor Scene Generation

Chuan Fang, Heng Li, Yixun Liang, Jia Zheng, Yongsen Mao, Yuan Liu, Rui Tang, Zihan Zhou, Ping Tan

TL;DR

SpatialGen tackles the challenge of high-fidelity 3D indoor scene generation by introducing a large-scale synthetic dataset and a layout-guided multi-view diffusion framework. It converts a 3D semantic layout into per-view semantic and geometry representations, then jointly synthesizes RGB, semantics, and scene coordinates via a cross-view, cross-modal attention mechanism, followed by iterative view expansion and Gaussian splatting for free-view rendering. The approach yields superior realism and semantic-geometry consistency compared to score-distillation and panorama-proxy baselines, and benefits from the expansive SpatialGen dataset. This work advances practical layout-conditioned 3D scene synthesis for design, AR/VR, and embodied AI, while openly releasing data and models to the community.

Abstract

Creating high-fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time-consuming and labor-intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large-scale, high-quality dataset tailored to this task. To address this gap, we introduce a comprehensive synthetic dataset, featuring 12,328 structured annotated scenes with 57,431 rooms, and 4.7M photorealistic 2D renderings. Leveraging this dataset, we present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes. Given a 3D layout and a reference image (derived from a text prompt), our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints, while preserving spatial consistency across modalities. SpatialGen consistently generates superior results to previous methods in our experiments. We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.

SPATIALGEN: Layout-guided 3D Indoor Scene Generation

TL;DR

SpatialGen tackles the challenge of high-fidelity 3D indoor scene generation by introducing a large-scale synthetic dataset and a layout-guided multi-view diffusion framework. It converts a 3D semantic layout into per-view semantic and geometry representations, then jointly synthesizes RGB, semantics, and scene coordinates via a cross-view, cross-modal attention mechanism, followed by iterative view expansion and Gaussian splatting for free-view rendering. The approach yields superior realism and semantic-geometry consistency compared to score-distillation and panorama-proxy baselines, and benefits from the expansive SpatialGen dataset. This work advances practical layout-conditioned 3D scene synthesis for design, AR/VR, and embodied AI, while openly releasing data and models to the community.

Abstract

Creating high-fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time-consuming and labor-intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large-scale, high-quality dataset tailored to this task. To address this gap, we introduce a comprehensive synthetic dataset, featuring 12,328 structured annotated scenes with 57,431 rooms, and 4.7M photorealistic 2D renderings. Leveraging this dataset, we present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes. Given a 3D layout and a reference image (derived from a text prompt), our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints, while preserving spatial consistency across modalities. SpatialGen consistently generates superior results to previous methods in our experiments. We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.

Paper Structure

This paper contains 48 sections, 4 equations, 23 figures, 5 tables, 1 algorithm.

Figures (23)

  • Figure 1: Given a 3D semantic layout, SpatialGen can generate a 3D indoor scene conditioned on either a textual description (left) or a reference image (middle). Furthermore, it can transform a real-world scene, where its 3D layout is estimated from a video by a layout estimator SpatialLM, into some brand new scenes.
  • Figure 2: Illustration of our dataset. For each scene, we provide comprehensive panoramic renderings and 3D layout annotation.
  • Figure 3: Overall pipeline. SpatialGen takes as input a 3D semantic layout and one or more posed images, to create a 3D scene. First, we generate per-view RGB images, scene coordinate maps, and semantic segmentation maps from a Layout-Guided Multi-view Multi-modal diffusion model. Then, we adopt an iterative dense view generation strategy to generate images at more sampled viewpoints. Finally, these images are fed into a 3D reconstruction method to produce the final result.
  • Figure 4: Multi-view and multi-modal alternating attention. It alternates between enforcing multi-view consistency and multi-modal fidelity within a unified attention mechanism.
  • Figure 5: Comparison of reconstruction results for scene coordinate map. The image VAE (a) generates noisy results, and the SCM-VAE without gradient loss (b) produces distorted results. Our SCM-VAE (c) accurately reconstructs the scene geometry.
  • ...and 18 more figures