Table of Contents
Fetching ...

ED-NeRF: Efficient Text-Guided Editing of 3D Scene with Latent Space NeRF

Jangho Park, Gihyun Kwon, Jong Chul Ye

TL;DR

ED-NeRF tackles efficient text-guided editing of 3D scenes by training NeRF in the latent space of a diffusion model and augmenting it with a refinement layer to capture cross-pixel interactions. It extends Delta Denoising Score (DDS) to 3D with masking and a reconstruction term to produce targeted edits that preserve structure and view coherence. Empirically, ED-NeRF achieves faster editing and higher-quality outputs than state-of-the-art 3D editing methods on real-world data. The approach enables scalable, practical 3D scene editing with diffusion priors.

Abstract

Recently, there has been a significant advancement in text-to-image diffusion models, leading to groundbreaking performance in 2D image generation. These advancements have been extended to 3D models, enabling the generation of novel 3D objects from textual descriptions. This has evolved into NeRF editing methods, which allow the manipulation of existing 3D objects through textual conditioning. However, existing NeRF editing techniques have faced limitations in their performance due to slow training speeds and the use of loss functions that do not adequately consider editing. To address this, here we present a novel 3D NeRF editing approach dubbed ED-NeRF by successfully embedding real-world scenes into the latent space of the latent diffusion model (LDM) through a unique refinement layer. This approach enables us to obtain a NeRF backbone that is not only faster but also more amenable to editing compared to traditional image space NeRF editing. Furthermore, we propose an improved loss function tailored for editing by migrating the delta denoising score (DDS) distillation loss, originally used in 2D image editing to the three-dimensional domain. This novel loss function surpasses the well-known score distillation sampling (SDS) loss in terms of suitability for editing purposes. Our experimental results demonstrate that ED-NeRF achieves faster editing speed while producing improved output quality compared to state-of-the-art 3D editing models.

ED-NeRF: Efficient Text-Guided Editing of 3D Scene with Latent Space NeRF

TL;DR

ED-NeRF tackles efficient text-guided editing of 3D scenes by training NeRF in the latent space of a diffusion model and augmenting it with a refinement layer to capture cross-pixel interactions. It extends Delta Denoising Score (DDS) to 3D with masking and a reconstruction term to produce targeted edits that preserve structure and view coherence. Empirically, ED-NeRF achieves faster editing and higher-quality outputs than state-of-the-art 3D editing methods on real-world data. The approach enables scalable, practical 3D scene editing with diffusion priors.

Abstract

Recently, there has been a significant advancement in text-to-image diffusion models, leading to groundbreaking performance in 2D image generation. These advancements have been extended to 3D models, enabling the generation of novel 3D objects from textual descriptions. This has evolved into NeRF editing methods, which allow the manipulation of existing 3D objects through textual conditioning. However, existing NeRF editing techniques have faced limitations in their performance due to slow training speeds and the use of loss functions that do not adequately consider editing. To address this, here we present a novel 3D NeRF editing approach dubbed ED-NeRF by successfully embedding real-world scenes into the latent space of the latent diffusion model (LDM) through a unique refinement layer. This approach enables us to obtain a NeRF backbone that is not only faster but also more amenable to editing compared to traditional image space NeRF editing. Furthermore, we propose an improved loss function tailored for editing by migrating the delta denoising score (DDS) distillation loss, originally used in 2D image editing to the three-dimensional domain. This novel loss function surpasses the well-known score distillation sampling (SDS) loss in terms of suitability for editing purposes. Our experimental results demonstrate that ED-NeRF achieves faster editing speed while producing improved output quality compared to state-of-the-art 3D editing models.
Paper Structure (22 sections, 13 equations, 14 figures, 2 tables)

This paper contains 22 sections, 13 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Qualitative results of our method. ED-NeRF successfully edited 3D scenes with given target text prompts while preserving the original object structure and background regions.
  • Figure 2: Overall pipeline of training and inference stage. (a) We optimize ED-NeRF in the latent space, supervised by source latent. Naively matching NeRF to a latent feature map during optimization can degrade view synthesis quality. (b) Inspired by the embedding process of Stable Diffusion, we integrated additional ResNet blocks and self-attention layers as a refinement layer. (c) All 3D scenes are decoded from the Decoder when ED-NeRF renders a novel view feature map.
  • Figure 3: Expanding DDS into 3D for ED-NeRF editing. Pretrained ED-NeRF renders the target latent feature map, and a scheduler of the denoising model perturbs it to the sampled time step. Concurrently, the scheduler adds noise to the source latent using the same time step. Each of them is fed into the denoising model, and the DDS is determined by subtracting two different SDS scores. In combination with a binary mask, masked DDS guides NeRF in the intended direction of the target prompt without causing unintended deformations.
  • Figure 4: Comparison with baseline models. ED-NeRF demonstrates outstanding performance in effectively altering specific objects compared to other models. Baseline methods often failed to maintain the region beyond the target objects and failed to guide the model towards the target text.
  • Figure 5: Ablation studies. (a) If we only use DDS loss, the model fails to maintain the attribute of untargeted regions and often fails to reflect text conditions. (b) If we do not use masked reconstruction regularization, again the regions beyond the target objects are excessively changed. (c) If we remove the mask from DDS, unwanted artifacts occur in untargeted regions. (d) With removing the proposed refinement layer, the results become blurry as the backbone NeRF cannot fully embed real-world scenes. Our proposed setting can modify a specific region in a 3D scene and follow the target word without causing unwanted deformations.
  • ...and 9 more figures