Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints
Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, Ping Tan
TL;DR
Ctrl-Room tackles the challenge of text-driven 3D indoor room generation by decoupling layout and appearance. It introduces a holistic scene code and a diffusion-based Layout Generation Stage, followed by a Layout-guided Appearance Stage that uses a fine-tuned ControlNet and a panoramic NeRF (PeRF) to produce coherent, textured 3D rooms. A mask-guided editing pipeline enables flexible, geometry-consistent modifications without retraining. Experiments on Structured3D show superior layout plausibility, view-consistency, and editability compared with prior methods, with extensive ablations and a user study supporting the claims. The work advances practical 3D scene synthesis by ensuring structure-aware generation and easy interactive editing, with future work toward multi-room scalability and broader data sources.
Abstract
Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.
