Table of Contents
Fetching ...

DualNeRF: Text-Driven 3D Scene Editing via Dual-Field Representation

Yuxuan Xiong, Yue Shi, Yishun Dou, Bingbing Ni

TL;DR

DualNeRF tackles the challenges of blurry backgrounds and local optima in text-driven 3D scene editing by introducing a dual-field representation (static $f_S$ and dynamic $f_D$) that preserves original scene features while enabling edits. It integrates a simulated annealing strategy into the iterative dataset update pipeline and employs a CLIP-based consistency indicator to filter edits, improving reliability and background fidelity. Empirical results show DualNeRF achieving comparable CLIP-based alignment to IN2N, better background restoration (higher SSIM), and stronger resistance to local optima, across multiple scenes and prompts. The approach advances 3D scene editing by combining robust guidance, global search capability, and quality-aware data updates, with practical potential for more reliable and user-friendly 3D content creation.

Abstract

Recently, denoising diffusion models have achieved promising results in 2D image generation and editing. Instruct-NeRF2NeRF (IN2N) introduces the success of diffusion into 3D scene editing through an "Iterative dataset update" (IDU) strategy. Though achieving fascinating results, IN2N suffers from problems of blurry backgrounds and trapping in local optima. The first problem is caused by IN2N's lack of efficient guidance for background maintenance, while the second stems from the interaction between image editing and NeRF training during IDU. In this work, we introduce DualNeRF to deal with these problems. We propose a dual-field representation to preserve features of the original scene and utilize them as additional guidance to the model for background maintenance during IDU. Moreover, a simulated annealing strategy is embedded into IDU to endow our model with the power of addressing local optima issues. A CLIP-based consistency indicator is used to further improve the editing quality by filtering out low-quality edits. Extensive experiments demonstrate that our method outperforms previous methods both qualitatively and quantitatively.

DualNeRF: Text-Driven 3D Scene Editing via Dual-Field Representation

TL;DR

DualNeRF tackles the challenges of blurry backgrounds and local optima in text-driven 3D scene editing by introducing a dual-field representation (static and dynamic ) that preserves original scene features while enabling edits. It integrates a simulated annealing strategy into the iterative dataset update pipeline and employs a CLIP-based consistency indicator to filter edits, improving reliability and background fidelity. Empirical results show DualNeRF achieving comparable CLIP-based alignment to IN2N, better background restoration (higher SSIM), and stronger resistance to local optima, across multiple scenes and prompts. The approach advances 3D scene editing by combining robust guidance, global search capability, and quality-aware data updates, with practical potential for more reliable and user-friendly 3D content creation.

Abstract

Recently, denoising diffusion models have achieved promising results in 2D image generation and editing. Instruct-NeRF2NeRF (IN2N) introduces the success of diffusion into 3D scene editing through an "Iterative dataset update" (IDU) strategy. Though achieving fascinating results, IN2N suffers from problems of blurry backgrounds and trapping in local optima. The first problem is caused by IN2N's lack of efficient guidance for background maintenance, while the second stems from the interaction between image editing and NeRF training during IDU. In this work, we introduce DualNeRF to deal with these problems. We propose a dual-field representation to preserve features of the original scene and utilize them as additional guidance to the model for background maintenance during IDU. Moreover, a simulated annealing strategy is embedded into IDU to endow our model with the power of addressing local optima issues. A CLIP-based consistency indicator is used to further improve the editing quality by filtering out low-quality edits. Extensive experiments demonstrate that our method outperforms previous methods both qualitatively and quantitatively.

Paper Structure

This paper contains 25 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Limitations of IN2N haque2023instruct. There are two main limitations exposed by IN2N: (1) blurry background and (2) being prone to the local optima. The first row shows a comparison of the background performance among the rendering results of the original scene, IN2N, and ours. IN2N generates the most blurry background. The second row shows an example of IN2N's local optima issues which manifests as incomplete edits to the original scene. In comparison, DualNeRF outputs satisfactory results.
  • Figure 2: The Overview of DualNeRF. DualNeRF consists of two neural radiance fields, including a static field $f_S$ and a dynamic field $f_D$ with the same network architecture. The static field $f_S$ is trained in the field initialization stage and frozen in the editing stage. The dynamic field $f_D$ is enabled during the editing stage and trained to achieve field editing. Two fields fuse in the hidden feature level. A simulated annealing-based IDU strategy is used to perform editing. Furthermore, a CLIP-based consistency indicator is calculated based on the inputs and outputs, which filters out low-quality edits softly and therefore cleans up the updated dataset.
  • Figure 3: Edits with Their CLIP-based Consistency. The right bottom image is the original image $I$, while the rest images are three IP2P edits based on the prompt "Make it Autumn". $I'_1$ is inconsistent with both original image $I$ and the prompt $y$, which leads to the lowest consistency score $\mathcal{S}$. $I'_2$ transfers the original image to an Autumn scenery but fails to restore the original image. $I'_3$ is the best edit with high consistency to both $I$ and $y$, resulting in the highest $\mathcal{S}$. These examples demonstrate the ability of $\mathcal{S}$ to filter out low-quality edits.
  • Figure 4: Qualitative Results. Comparison between DualNeRF and Instruct-NeRF2NeRF haque2023instruct over different scenes with different prompts. Three columns respectively represent the original scene, the editing results of IN2N, and the editing results of DualNeRF. We strongly recommend readers to zoom in for a clearer observation.
  • Figure 5: Comparison with SOTA 2D Image Editing Methods. The four columns respectively show the original scene and editing results from different views generated by ControlNet zhang2023adding, IP2P brooks2023instructpix2pix, and ours. The prompts used in two cases are "Turn the bear into a panda" and "Turn him into a clown" respectively.
  • ...and 1 more figures