Table of Contents
Fetching ...

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, Ping Tan

TL;DR

Ctrl-Room tackles the challenge of text-driven 3D indoor room generation by decoupling layout and appearance. It introduces a holistic scene code and a diffusion-based Layout Generation Stage, followed by a Layout-guided Appearance Stage that uses a fine-tuned ControlNet and a panoramic NeRF (PeRF) to produce coherent, textured 3D rooms. A mask-guided editing pipeline enables flexible, geometry-consistent modifications without retraining. Experiments on Structured3D show superior layout plausibility, view-consistency, and editability compared with prior methods, with extensive ablations and a user study supporting the claims. The work advances practical 3D scene synthesis by ensuring structure-aware generation and easy interactive editing, with future work toward multi-room scalability and broader data sources.

Abstract

Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

TL;DR

Ctrl-Room tackles the challenge of text-driven 3D indoor room generation by decoupling layout and appearance. It introduces a holistic scene code and a diffusion-based Layout Generation Stage, followed by a Layout-guided Appearance Stage that uses a fine-tuned ControlNet and a panoramic NeRF (PeRF) to produce coherent, textured 3D rooms. A mask-guided editing pipeline enables flexible, geometry-consistent modifications without retraining. Experiments on Structured3D show superior layout plausibility, view-consistency, and editability compared with prior methods, with extensive ablations and a user study supporting the claims. The work advances practical 3D scene synthesis by ensuring structure-aware generation and easy interactive editing, with future work toward multi-room scalability and broader data sources.

Abstract

Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.
Paper Structure (33 sections, 9 equations, 23 figures, 3 tables)

This paper contains 33 sections, 9 equations, 23 figures, 3 tables.

Figures (23)

  • Figure 1: We present Ctrl-Room to achieve fine-grained textured 3D indoor room generation and editing. (a) compared with the Text2Room hollein2023text2room and MVDiffusiontang2023mvdiffusion, Ctrl-Room can generate rooms with more plausible 3D structures. (b) Ctrl-Room supports flexible editing. Users can replace furniture items or change their positions easily.
  • Figure 2: Framework overview. In Layout Generation Stage, we synthesize a scene code from the text input and convert it to a 3D bounding box representation to facilitate editing. In Appearance Generation Stage, we project the bounding boxes into a semantic segmentation map to guide the panorama synthesis. The panorama is then reconstructed into a panoramic NeRF (PeRF)wang2023perfmodel with layout guidance.
  • Figure 3: (a) A 3D scene $S$ is represented by its scene code $x_0 = \{o_i\}_{i=1}^{N}$, where each wall or furniture item $o_i$ is a row vector storing attributes like class label $c_i$, location $l_i$, size $s_i$, orientation $r_i$. (b) During the denoising process, we rotate both the input semantic layout panorama and the denoised image for $\gamma$ degree at each step. Here we take $\gamma=90^\circ$ for example.
  • Figure 4: The Layout-guided PeRF takes the input panorama, aligned depth map, and normal map as initialization. Then, a progressive inpainting module is introduced to generate consistent panoramic images at sampled novel views. The progressive inpainting module consists of the layout-guided panorama inpainting and the layout-guided depth estimation module. The final RGB-D panoramic pairs are included as training views to finetune PeRF wang2023perf.
  • Figure 5: Qualitative comparison with previous works. For each method, we show a textured 3D mesh in the first row and two rendered images in the second row.
  • ...and 18 more figures