Table of Contents
Fetching ...

RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

Liangchen Song, Liangliang Cao, Hongyu Xu, Kai Kang, Feng Tang, Junsong Yuan, Yang Zhao

TL;DR

This work tackles text-driven editing of real indoor 3D scenes by refining both geometry and texture of a scanned mesh. It introduces Geometry Guided Diffusion to produce a coherent cubemap texture conditioned on text and depth priors, followed by Mesh Optimization that jointly updates texture and geometry via differentiable rendering and pseudo-depth supervision. A distance-map blending strategy ensures cross-face consistency, and extensive experiments on ARKitScenes demonstrate improved texture quality, geometry smoothness, and global style coherence compared with NeRF-based and image-guided baselines. The approach enables robust, style-controlled editing of real-world interiors with practical applicability on consumer-scanned data.

Abstract

The techniques for 3D indoor scene capturing are widely used, but the meshes produced leave much to be desired. In this paper, we propose "RoomDreamer", which leverages powerful natural language to synthesize a new room with a different style. Unlike existing image synthesis methods, our work addresses the challenge of synthesizing both geometry and texture aligned to the input scene structure and prompt simultaneously. The key insight is that a scene should be treated as a whole, taking into account both scene texture and geometry. The proposed framework consists of two significant components: Geometry Guided Diffusion and Mesh Optimization. Geometry Guided Diffusion for 3D Scene guarantees the consistency of the scene style by applying the 2D prior to the entire scene simultaneously. Mesh Optimization improves the geometry and texture jointly and eliminates the artifacts in the scanned scene. To validate the proposed method, real indoor scenes scanned with smartphones are used for extensive experiments, through which the effectiveness of our method is demonstrated.

RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture

TL;DR

This work tackles text-driven editing of real indoor 3D scenes by refining both geometry and texture of a scanned mesh. It introduces Geometry Guided Diffusion to produce a coherent cubemap texture conditioned on text and depth priors, followed by Mesh Optimization that jointly updates texture and geometry via differentiable rendering and pseudo-depth supervision. A distance-map blending strategy ensures cross-face consistency, and extensive experiments on ARKitScenes demonstrate improved texture quality, geometry smoothness, and global style coherence compared with NeRF-based and image-guided baselines. The approach enables robust, style-controlled editing of real-world interiors with practical applicability on consumer-scanned data.

Abstract

The techniques for 3D indoor scene capturing are widely used, but the meshes produced leave much to be desired. In this paper, we propose "RoomDreamer", which leverages powerful natural language to synthesize a new room with a different style. Unlike existing image synthesis methods, our work addresses the challenge of synthesizing both geometry and texture aligned to the input scene structure and prompt simultaneously. The key insight is that a scene should be treated as a whole, taking into account both scene texture and geometry. The proposed framework consists of two significant components: Geometry Guided Diffusion and Mesh Optimization. Geometry Guided Diffusion for 3D Scene guarantees the consistency of the scene style by applying the 2D prior to the entire scene simultaneously. Mesh Optimization improves the geometry and texture jointly and eliminates the artifacts in the scanned scene. To validate the proposed method, real indoor scenes scanned with smartphones are used for extensive experiments, through which the effectiveness of our method is demonstrated.
Paper Structure (11 sections, 8 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 8 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Our method aims at jointly improving geometry and generating texture for an input indoor mesh. The upper figure shows the true room with a panoramic view and a depth map. Then, given a text prompt (in the middle), our model can synthesize new rooms with different styles (in the bottom rows). Note that input mesh is often of low quality, and our method can polish both the texture and geometry.
  • Figure 2: The overall framework of our method. Firstly, in the Geometry Guided Diffusion stage for 3D scenes, we create a cubemap representing the scene, followed by outpainting the uncovered areas of the cubemap, as detailed in \ref{['subsection-3d-diffusion']}. Subsequently, we optimize the mesh texture and geometry. For the geometry optimization, we utilize monocular depth prediction as pseudo supervision and align the smooth areas of the scene, as elaborated in \ref{['sec:mesh']}.
  • Figure 3: Methods for generating scene texture. The step "diffuse" means generating a 2D image with diffusion models. The "optimize" means updating the mesh texture with the 2D generated images (cf., \ref{['eq-update-texture-only']}). (a) A straightforward baseline based on outpainting with 2D diffusion models. Outpainting is achieved by masked diffusion and the gray area means the masked area remains unchanged through the diffusion. (b) Generating a cubemap for the scene, then optimizing the mesh texture.
  • Figure 4: Illustration of the depth map and the distance map. Depth map measures the length between the object plane to the screen plane, while distance map measures the length between points to the camera origin.
  • Figure 5: Different controlling effects of the depth map and the distance map. The depth map exhibits rapid change at the joint boundary of the two faces of the cube map. Conversely, the distance map changes smoothly. Generating consistent cube maps with depth control becomes challenging, whereas the employment of distance map results in more consistent texture. However, the distance map results in artifacts such as the window on the wall, as the diffusion model is conditioned on the depth map during training.
  • ...and 4 more figures