Table of Contents
Fetching ...

Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

Shuangkang Fang, Yufeng Wang, Yi-Hsuan Tsai, Yi Yang, Wenrui Ding, Shuchang Zhou, Ming-Hsuan Yang

TL;DR

CE3D introduces a dialogue-driven framework for interactive 3D scene editing that decouples 2D editing from 3D reconstruction via the Hash-Atlas representation, enabling flexible integration of a wide range of visual models. An LLM-based dialog system orchestrates multiple 2D/3D visual experts, parses arbitrary user prompts, and drives atlas-space edits that are mapped back to the 3D scene. The Hash-Atlas network provides robust, fast atlas construction and enables high-fidelity atlas-based editing, supported by targeted losses and a merge–split editing strategy. Experimental results demonstrate strong atlas quality, scalable multi-round dialogue editing, and superior editing capabilities compared with baselines, highlighting CE3D’s practicality for unconstrained, interactive 3D editing with real-world scenes.

Abstract

Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing. However, existing schemes for 3D scene editing still exhibit certain shortcomings, hindering their further interactive design. Such schemes typically adhere to fixed input patterns, limiting users' flexibility in text input. Moreover, their editing capabilities are constrained by a single or a few 2D visual models and require intricate pipeline design to integrate these models into 3D reconstruction processes. To address the aforementioned issues, we propose a dialogue-based 3D scene editing approach, termed CE3D, which is centered around a large language model that allows for arbitrary textual input from users and interprets their intentions, subsequently facilitating the autonomous invocation of the corresponding visual expert models. Furthermore, we design a scheme utilizing Hash-Atlas to represent 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images. This design achieves complete decoupling between the 2D editing and 3D reconstruction processes, enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visual models without necessitating intricate fusion designs. Experimental results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects, possessing strong scene comprehension and multi-round dialog capabilities. The code is available at https://sk-fun.fun/CE3D.

Chat-Edit-3D: Interactive 3D Scene Editing via Text Prompts

TL;DR

CE3D introduces a dialogue-driven framework for interactive 3D scene editing that decouples 2D editing from 3D reconstruction via the Hash-Atlas representation, enabling flexible integration of a wide range of visual models. An LLM-based dialog system orchestrates multiple 2D/3D visual experts, parses arbitrary user prompts, and drives atlas-space edits that are mapped back to the 3D scene. The Hash-Atlas network provides robust, fast atlas construction and enables high-fidelity atlas-based editing, supported by targeted losses and a merge–split editing strategy. Experimental results demonstrate strong atlas quality, scalable multi-round dialogue editing, and superior editing capabilities compared with baselines, highlighting CE3D’s practicality for unconstrained, interactive 3D editing with real-world scenes.

Abstract

Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing. However, existing schemes for 3D scene editing still exhibit certain shortcomings, hindering their further interactive design. Such schemes typically adhere to fixed input patterns, limiting users' flexibility in text input. Moreover, their editing capabilities are constrained by a single or a few 2D visual models and require intricate pipeline design to integrate these models into 3D reconstruction processes. To address the aforementioned issues, we propose a dialogue-based 3D scene editing approach, termed CE3D, which is centered around a large language model that allows for arbitrary textual input from users and interprets their intentions, subsequently facilitating the autonomous invocation of the corresponding visual expert models. Furthermore, we design a scheme utilizing Hash-Atlas to represent 3D scene views, which transfers the editing of 3D scenes onto 2D atlas images. This design achieves complete decoupling between the 2D editing and 3D reconstruction processes, enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visual models without necessitating intricate fusion designs. Experimental results demonstrate that CE3D effectively integrates multiple visual models to achieve diverse editing visual effects, possessing strong scene comprehension and multi-round dialog capabilities. The code is available at https://sk-fun.fun/CE3D.
Paper Structure (17 sections, 7 equations, 15 figures, 4 tables)

This paper contains 17 sections, 7 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Examples of chatting with CE3D. We propose CE3D, a novel paradigm for 3D scene editing, which is compatible with a variety of extant visual models. By managing these visual experts through the LLMs, we achieve challenging dialogue-based scene editing tasks that are difficult to accomplish with previous methods.
  • Figure 2: Differences between CE3D and other typical text-driven editing frameworks. Existing methods (a)-(c) require complex design to integrate specific 2D and 3D models. While CE3D (d) decouples the 2D editing process from the 3D representation, enhancing its compatibility with various visual models.
  • Figure 3: Illustration of the CE3D framework. The basic process is as follows: (1) Given the user's text query, ChatGPT interprets the text and determines whether visual tools are required for this dialogue. (2) When visual tools are needed, ChatGPT will call the desired tools from the model zoo and provide them with corresponding parameters. (3) The Backend further queries the atlases and other files to be invoked. In addition, if the atlases do not exist, the Backend first acquires them using the Hash-Atlas network. (4) The Executor executes visual tools to edit the atlases and feeds back new status to ChatGPT for subsequent actions. The edited atlases are then mapped back to the 3D scene views through the Hash-Atlas network for later scene reconstruction. (5) Since one dialogue may require multiple model calls, ChatGPT repeats the above process until it determines that visual tools are no longer needed. Then the Frontend responds to the user with the editing results and ChatGPT's outputs.
  • Figure 4: Illustration of Hash-Atlas. Given the views of a 3D scene, $F_m$ first maps each pixel coordinate from the views to two UV spaces and predicts the transparency of the foreground atlas. Subsequently, $F_h$ predicts the RGB values at each coordinate in the UV space, thereby obtaining the foreground and background atlases. When the atlases are edited, they can be mapped back to the original scene views.
  • Figure 5: Qualitative comparisons with LNA kasten2021layered-atlas1 for atlas. The atlases obtained from the Hash-Atlas preserve a greater amount of scene details. Moreover, our method ensures that the relative position of the objects within the foreground and background atlases remains largely invariant and guarantees that the scene objects in the atlases undergo minimal distortion and deformation. This provision of more precise scene target locations and more commonsensical visual information to subsequent editing models enables the attainment of better editing outcomes.
  • ...and 10 more figures