RoomPlanner: Explicit Layout Planner for Easier LLM-Driven 3D Room Generation
Wenzhuo Sun, Mingjian Liang, Wenxuan Song, Xuelian Cheng, Zongyuan Ge
TL;DR
RoomPlanner addresses the challenge of automatic, text-driven 3D indoor scene generation by integrating hierarchical LLM-based reasoning and grounding with explicit layout constraints. It couples a layout-aware planning stage (collision and reachability) with differentiable scene optimization leveraging 3D Gaussian representations and diffusion priors, augmented by the AnyReach camera trajectory and Interval Timestep Flow Sampling to deliver high-quality, editable scenes in under 30 minutes. Key contributions include a fully automated, end-to-end pipeline, explicit spatial constraints for plausible layouts, and a single-pass optimization that yields physically coherent, configurable interiors with improved rendering speed and visual fidelity. The framework demonstrates superior qualitative and quantitative performance against prior methods and supports broad editability, making it practical for design, embodied AI, and virtual production workflows.
Abstract
In this paper, we propose RoomPlanner, the first fully automatic 3D room generation framework for painlessly creating realistic indoor scenes with only short text as input. Without any manual layout design or panoramic image guidance, our framework can generate explicit layout criteria for rational spatial placement. We begin by introducing a hierarchical structure of language-driven agent planners that can automatically parse short and ambiguous prompts into detailed scene descriptions. These descriptions include raw spatial and semantic attributes for each object and the background, which are then used to initialize 3D point clouds. To position objects within bounded environments, we implement two arrangement constraints that iteratively optimize spatial arrangements, ensuring a collision-free and accessible layout solution. In the final rendering stage, we propose a novel AnyReach Sampling strategy for camera trajectory, along with the Interval Timestep Flow Sampling (ITFS) strategy, to efficiently optimize the coarse 3D Gaussian scene representation. These approaches help reduce the total generation time to under 30 minutes. Extensive experiments demonstrate that our method can produce geometrically rational 3D indoor scenes, surpassing prior approaches in both rendering speed and visual quality while preserving editability. The code will be available soon.
