Table of Contents
Fetching ...

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

Kaizhi Zheng, Xiaotong Chen, Xuehai He, Jing Gu, Linjie Li, Zhengyuan Yang, Kevin Lin, Jianfeng Wang, Lijuan Wang, Xin Eric Wang

TL;DR

EditRoom addresses language-guided 3D room layout editing by introducing a two-component system: an LLM-based command parameterizer that converts natural language instructions into atomic editing operations, and a graph-diffusion-based scene editor that jointly predicts target scene graphs and layouts conditioned on the source scene and the instructions. It augments data with EditRoom-DB, a large synthetic dataset of ~83k editing pairs generated from existing 3D scene repositories, to train and evaluate the approach. Empirically, EditRoom outperforms baselines on single-operation edits across multiple room types and demonstrates generalization to complex, multi-operation prompts without additional training. The work advances end-to-end language-guided 3D scene editing, enabling composable layout changes for applications in VR/AR and gaming, backed by a scalable dataset and a diffusion-based editing framework.

Abstract

Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming. However, recent approaches to language-guided 3D scene editing either require manual interventions or focus only on appearance modifications without supporting comprehensive scene layout changes. In response, we propose EditRoom, a unified framework capable of executing a variety of layout edits through natural language commands, without requiring manual intervention. Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms other baselines across all metrics, indicating higher accuracy and coherence in language-guided scene layout editing.

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

TL;DR

EditRoom addresses language-guided 3D room layout editing by introducing a two-component system: an LLM-based command parameterizer that converts natural language instructions into atomic editing operations, and a graph-diffusion-based scene editor that jointly predicts target scene graphs and layouts conditioned on the source scene and the instructions. It augments data with EditRoom-DB, a large synthetic dataset of ~83k editing pairs generated from existing 3D scene repositories, to train and evaluate the approach. Empirically, EditRoom outperforms baselines on single-operation edits across multiple room types and demonstrates generalization to complex, multi-operation prompts without additional training. The work advances end-to-end language-guided 3D scene editing, enabling composable layout changes for applications in VR/AR and gaming, backed by a scalable dataset and a diffusion-based editing framework.

Abstract

Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming. However, recent approaches to language-guided 3D scene editing either require manual interventions or focus only on appearance modifications without supporting comprehensive scene layout changes. In response, we propose EditRoom, a unified framework capable of executing a variety of layout edits through natural language commands, without requiring manual intervention. Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms other baselines across all metrics, indicating higher accuracy and coherence in language-guided scene layout editing.

Paper Structure

This paper contains 20 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Editing Pipeline with EditRoom. EditRoom is a unified language-guided 3D scene layout editing framework that can automatically execute all layout editing types with natural language commands, which includes the command parameterizer for natural language comprehension and the scene editor for editing execution. Given a source scene and natural language commands, it can generate a coherent and appropriate target scene.
  • Figure 2: Scene Editor Overview. Scene Editor aims to provide accurate, coherent editing results according to the given source scene and language commands. It consists of two graph transformer-based conditional diffusion models. One diffusion model generates semantic target scene graphs. Another diffusion model can estimate accurate poses and size information for each object inside the generated target scene graphs. All diffusion processes are conditioned on the source scene and breakdown command.
  • Figure 3: Qualitative results on single-operation commands. The left column is the source scene with single operation commands for each basic editing type. From the examples, we can find that EditRoom can provide more coherent and appropriate editing operations across all editing types.
  • Figure 4: Qualitative results on multi-operation commands. The left column is the source scene with multi-operation operation commands. From the figure, we can find the EditRoom can successfully generalize to complex natural language commands with multiple operations without further training on the multi-operation operation data, while baselines fail to execute coherent editing.
  • Figure 5: Visualization of layout diffusion denoising process. The whole diffusion process is conditioned on both source scene and language commands. At the beginning of the process, the target scene layout starts from random noises. After the iterative diffusion denoising process, the target scene layout becomes coherent to source scene and command.
  • ...and 2 more figures