Table of Contents
Fetching ...

MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

Shaoheng Fang, Chaohui Yu, Fan Wang, Qixing Huang

TL;DR

MVRoom addresses the challenge of controllable 3D indoor scene generation by introducing a two-stage, layout-conditioned NVS pipeline that converts a coarse 3D layout into rich multi-view conditioning signals and employs a diffusion model with layout-aware epipolar attention to ensure cross-view consistency. It adds a recursive scene-generation framework that explores camera trajectories guided by the layout and maintains a global point cloud to sustain global coherence, culminating in high-fidelity 3D-GS reconstructions. Empirical results on 3D-FRONT show significant improvements in multi-view consistency and perceptual quality, supported by ablations validating the key components. The approach enables text-to-scene generation and robust scene completion, with potential impact for AR/VR content creation and immersive environments.

Abstract

We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.

MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

TL;DR

MVRoom addresses the challenge of controllable 3D indoor scene generation by introducing a two-stage, layout-conditioned NVS pipeline that converts a coarse 3D layout into rich multi-view conditioning signals and employs a diffusion model with layout-aware epipolar attention to ensure cross-view consistency. It adds a recursive scene-generation framework that explores camera trajectories guided by the layout and maintains a global point cloud to sustain global coherence, culminating in high-fidelity 3D-GS reconstructions. Empirical results on 3D-FRONT show significant improvements in multi-view consistency and perceptual quality, supported by ablations validating the key components. The approach enables text-to-scene generation and robust scene completion, with potential impact for AR/VR content creation and immersive environments.

Abstract

We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.

Paper Structure

This paper contains 22 sections, 6 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: We introduce MVRoom, an indoor scene generation pipeline utilizing multi-view diffusion models. Given a 3D layout and an initial image (generated from a text description). MVRoom uses conditional layout-aware multi-view diffusion models to generate consistent novel views along continuous camera trajectories within the 3D scene. The consistent views are fed into a 3D-GS pipeline for scene reconstruction and novel-view synthesis.
  • Figure 2: MVRoom overview. We employ a two-stage generation pipeline. The first stage focuses on gathering image-based multi-view conditions, including hybrid layout priors derived from 3D scene layout $\mathcal{L}$ and camera poses $\mathbf{p}_i$ and image conditions $\mathbf{P}^i$ derived from initial views or generated views. The second stage is a multi-view diffusion model that takes the image-based conditions as input and generates scene-level consistent views. The diffusion model features a layout-aware epipolar attention module to aggregate cross-view features more efficiently according to epipolar geometry and 3D layout.
  • Figure 3: Comprehensive 3D layout conditions. Given a camera parameter $\mathbf{p}_i$, we convert the layout $\mathcal{L}$ into multiple image conditions $\mathbf{P}^i$.
  • Figure 4: Qualitative results for scene generation methods. For LucidDreamer chung2023luciddreamer and Text2room hollein2023text2room, we use the same perspective view image as in our method and a text descrpition to generate a complete 3D scene. For Set-the-scene cohen2023set, we provide the scene layout along with a text description as input. The stable-diffusion-2-inpainting model used in baselines is fine-tuned on our dataset for fair comparison.
  • Figure 5: Text-to-scene generation. We show the 3D-GS rendering results with the corresponding 3D layout and text input.
  • ...and 4 more figures