DualNeRF: Text-Driven 3D Scene Editing via Dual-Field Representation

Yuxuan Xiong; Yue Shi; Yishun Dou; Bingbing Ni

DualNeRF: Text-Driven 3D Scene Editing via Dual-Field Representation

Yuxuan Xiong, Yue Shi, Yishun Dou, Bingbing Ni

TL;DR

DualNeRF tackles the challenges of blurry backgrounds and local optima in text-driven 3D scene editing by introducing a dual-field representation (static $f_S$ and dynamic $f_D$) that preserves original scene features while enabling edits. It integrates a simulated annealing strategy into the iterative dataset update pipeline and employs a CLIP-based consistency indicator to filter edits, improving reliability and background fidelity. Empirical results show DualNeRF achieving comparable CLIP-based alignment to IN2N, better background restoration (higher SSIM), and stronger resistance to local optima, across multiple scenes and prompts. The approach advances 3D scene editing by combining robust guidance, global search capability, and quality-aware data updates, with practical potential for more reliable and user-friendly 3D content creation.

Abstract

Recently, denoising diffusion models have achieved promising results in 2D image generation and editing. Instruct-NeRF2NeRF (IN2N) introduces the success of diffusion into 3D scene editing through an "Iterative dataset update" (IDU) strategy. Though achieving fascinating results, IN2N suffers from problems of blurry backgrounds and trapping in local optima. The first problem is caused by IN2N's lack of efficient guidance for background maintenance, while the second stems from the interaction between image editing and NeRF training during IDU. In this work, we introduce DualNeRF to deal with these problems. We propose a dual-field representation to preserve features of the original scene and utilize them as additional guidance to the model for background maintenance during IDU. Moreover, a simulated annealing strategy is embedded into IDU to endow our model with the power of addressing local optima issues. A CLIP-based consistency indicator is used to further improve the editing quality by filtering out low-quality edits. Extensive experiments demonstrate that our method outperforms previous methods both qualitatively and quantitatively.

DualNeRF: Text-Driven 3D Scene Editing via Dual-Field Representation

TL;DR

DualNeRF tackles the challenges of blurry backgrounds and local optima in text-driven 3D scene editing by introducing a dual-field representation (static

and dynamic

) that preserves original scene features while enabling edits. It integrates a simulated annealing strategy into the iterative dataset update pipeline and employs a CLIP-based consistency indicator to filter edits, improving reliability and background fidelity. Empirical results show DualNeRF achieving comparable CLIP-based alignment to IN2N, better background restoration (higher SSIM), and stronger resistance to local optima, across multiple scenes and prompts. The approach advances 3D scene editing by combining robust guidance, global search capability, and quality-aware data updates, with practical potential for more reliable and user-friendly 3D content creation.

DualNeRF: Text-Driven 3D Scene Editing via Dual-Field Representation

TL;DR

Abstract

DualNeRF: Text-Driven 3D Scene Editing via Dual-Field Representation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)