EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing
Kaizhi Zheng, Xiaotong Chen, Xuehai He, Jing Gu, Linjie Li, Zhengyuan Yang, Kevin Lin, Jianfeng Wang, Lijuan Wang, Xin Eric Wang
TL;DR
EditRoom addresses language-guided 3D room layout editing by introducing a two-component system: an LLM-based command parameterizer that converts natural language instructions into atomic editing operations, and a graph-diffusion-based scene editor that jointly predicts target scene graphs and layouts conditioned on the source scene and the instructions. It augments data with EditRoom-DB, a large synthetic dataset of ~83k editing pairs generated from existing 3D scene repositories, to train and evaluate the approach. Empirically, EditRoom outperforms baselines on single-operation edits across multiple room types and demonstrates generalization to complex, multi-operation prompts without additional training. The work advances end-to-end language-guided 3D scene editing, enabling composable layout changes for applications in VR/AR and gaming, backed by a scalable dataset and a diffusion-based editing framework.
Abstract
Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming. However, recent approaches to language-guided 3D scene editing either require manual interventions or focus only on appearance modifications without supporting comprehensive scene layout changes. In response, we propose EditRoom, a unified framework capable of executing a variety of layout edits through natural language commands, without requiring manual intervention. Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms other baselines across all metrics, indicating higher accuracy and coherence in language-guided scene layout editing.
