Table of Contents
Fetching ...

DreamEditor: Text-Driven 3D Scene Editing with Neural Fields

Jingyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, Guanbin Li

TL;DR

DreamEditor tackles the challenge of editing neural fields by converting them to mesh-based representations, enabling precise, region-local edits driven by text prompts. It identifies editing regions through cross-attention maps from a fine-tuned diffusion model and applies score distillation sampling within those regions to jointly adjust geometry and texture, while preserving unedited areas. The approach delivers high-fidelity, locally edited 3D scenes that surpass prior methods in both qualitative realism and quantitative alignment with prompts. Limitations include lighting control and Janus artifacts, with future plans to extend editing to unbounded scenes and improve lighting realism.

Abstract

Neural fields have achieved impressive advancements in view synthesis and scene reconstruction. However, editing these neural fields remains challenging due to the implicit encoding of geometry and texture information. In this paper, we propose DreamEditor, a novel framework that enables users to perform controlled editing of neural fields using text prompts. By representing scenes as mesh-based neural fields, DreamEditor allows localized editing within specific regions. DreamEditor utilizes the text encoder of a pretrained text-to-Image diffusion model to automatically identify the regions to be edited based on the semantics of the text prompts. Subsequently, DreamEditor optimizes the editing region and aligns its geometry and texture with the text prompts through score distillation sampling [29]. Extensive experiments have demonstrated that DreamEditor can accurately edit neural fields of real-world scenes according to the given text prompts while ensuring consistency in irrelevant areas. DreamEditor generates highly realistic textures and geometry, significantly surpassing previous works in both quantitative and qualitative evaluations.

DreamEditor: Text-Driven 3D Scene Editing with Neural Fields

TL;DR

DreamEditor tackles the challenge of editing neural fields by converting them to mesh-based representations, enabling precise, region-local edits driven by text prompts. It identifies editing regions through cross-attention maps from a fine-tuned diffusion model and applies score distillation sampling within those regions to jointly adjust geometry and texture, while preserving unedited areas. The approach delivers high-fidelity, locally edited 3D scenes that surpass prior methods in both qualitative realism and quantitative alignment with prompts. Limitations include lighting control and Janus artifacts, with future plans to extend editing to unbounded scenes and improve lighting realism.

Abstract

Neural fields have achieved impressive advancements in view synthesis and scene reconstruction. However, editing these neural fields remains challenging due to the implicit encoding of geometry and texture information. In this paper, we propose DreamEditor, a novel framework that enables users to perform controlled editing of neural fields using text prompts. By representing scenes as mesh-based neural fields, DreamEditor allows localized editing within specific regions. DreamEditor utilizes the text encoder of a pretrained text-to-Image diffusion model to automatically identify the regions to be edited based on the semantics of the text prompts. Subsequently, DreamEditor optimizes the editing region and aligns its geometry and texture with the text prompts through score distillation sampling [29]. Extensive experiments have demonstrated that DreamEditor can accurately edit neural fields of real-world scenes according to the given text prompts while ensuring consistency in irrelevant areas. DreamEditor generates highly realistic textures and geometry, significantly surpassing previous works in both quantitative and qualitative evaluations.
Paper Structure (17 sections, 7 equations, 9 figures, 1 table)

This paper contains 17 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The overview of our method. Our method edits a 3D scene by optimizing an existing neural field to conform with a target text prompt. The editing process involves three steps: (1) The original neural field is distilled into a mesh-based one. (2) Based on the text prompts, our method automatically identifies the editing region of the mesh-based neural field. (3) Our method utilizes the SDS loss to optimize the color feature $f_c$, geometry feature $f_g$, and vertex positions $v$ of the editing region, thereby altering the texture and geometry of the respective region. Best viewed in color.
  • Figure 2: Visual results of our method compared with two baselines on three different scenes. The results clearly show that DreamEditor can precisely locate the relevant region, perform faithful editing to the text, and prevent undesirable modifications, which are difficult to be achieved by the baseline methods.
  • Figure 3: Ablation study of locating step. Editing without the locating step will deform the doll, breaking the consistency of the object.
  • Figure 4: Ablation study of optimizing approach. Obviously, simultaneously optimizes both geometry features and vertex positions (Ours) and generates red roses with more detailed and realistic 3D shapes.
  • Figure 5: Visualization of the editing region, where the bold words indicate keywords and the red area on the mesh represents the editing region.
  • ...and 4 more figures