Table of Contents
Fetching ...

DATENeRF: Depth-Aware Text-based Editing of NeRFs

Sara Rojas, Julien Philip, Kai Zhang, Sai Bi, Fujun Luan, Bernard Ghanem, Kalyan Sunkavall

TL;DR

DATENeRF tackles multiview-consistent text-based editing of NeRF scenes by leveraging depth information as a geometry-guided bridge for 2D diffusion edits. The method uses depth-conditioned ControlNet for coherent edits, followed by a projection-based inpainting scheme that initializes edits with reprojected pixels before full inpainting, and ends with NeRF optimization to fuse changes into the 3D volume. This combination yields higher fidelity, more photorealistic textures, and stronger geometric consistency than prior approaches, while also supporting edge-guided and object-insertion edits. The approach accelerates convergence and broadens editing control, albeit with limitations on large geometric changes and ethical considerations around realistic content manipulation.

Abstract

Recent advancements in diffusion models have shown remarkable proficiency in editing 2D images based on text prompts. However, extending these techniques to edit scenes in Neural Radiance Fields (NeRF) is complex, as editing individual 2D frames can result in inconsistencies across multiple views. Our crucial insight is that a NeRF scene's geometry can serve as a bridge to integrate these 2D edits. Utilizing this geometry, we employ a depth-conditioned ControlNet to enhance the coherence of each 2D image modification. Moreover, we introduce an inpainting approach that leverages the depth information of NeRF scenes to distribute 2D edits across different images, ensuring robustness against errors and resampling challenges. Our results reveal that this methodology achieves more consistent, lifelike, and detailed edits than existing leading methods for text-driven NeRF scene editing.

DATENeRF: Depth-Aware Text-based Editing of NeRFs

TL;DR

DATENeRF tackles multiview-consistent text-based editing of NeRF scenes by leveraging depth information as a geometry-guided bridge for 2D diffusion edits. The method uses depth-conditioned ControlNet for coherent edits, followed by a projection-based inpainting scheme that initializes edits with reprojected pixels before full inpainting, and ends with NeRF optimization to fuse changes into the 3D volume. This combination yields higher fidelity, more photorealistic textures, and stronger geometric consistency than prior approaches, while also supporting edge-guided and object-insertion edits. The approach accelerates convergence and broadens editing control, albeit with limitations on large geometric changes and ethical considerations around realistic content manipulation.

Abstract

Recent advancements in diffusion models have shown remarkable proficiency in editing 2D images based on text prompts. However, extending these techniques to edit scenes in Neural Radiance Fields (NeRF) is complex, as editing individual 2D frames can result in inconsistencies across multiple views. Our crucial insight is that a NeRF scene's geometry can serve as a bridge to integrate these 2D edits. Utilizing this geometry, we employ a depth-conditioned ControlNet to enhance the coherence of each 2D image modification. Moreover, we introduce an inpainting approach that leverages the depth information of NeRF scenes to distribute 2D edits across different images, ensuring robustness against errors and resampling challenges. Our results reveal that this methodology achieves more consistent, lifelike, and detailed edits than existing leading methods for text-driven NeRF scene editing.
Paper Structure (19 sections, 7 equations, 15 figures, 2 tables)

This paper contains 19 sections, 7 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: DATENeRF uses a reconstructed NeRF scenes's depth to guide text-based image edits. Compared to the state-of-the-art Instruct-NeRF2NeRF haque2023instruct method (top row), our method (bottom row) produces results that are significantly more photorealistic and better preserve high-frequency details across a diverse range of text prompts.
  • Figure 2: Overview. Our input is a NeRF (with its posed input images) and per-view editing masks and an edit text prompt. We use the NeRF depth to condition the masked region inpainting. We reproject this edited result to a subsequent viewpoint and using a hybrid inpainting scheme that first only inpaints disoccluded regions and then refines the entire masked region. This is done by changing the inpainting masks (indicated by the blue and orange blocks on the right side) during diffusion.
  • Figure 3: Projection Inpainting. We analyze our proposed scheme using various views of the input sequence (row A) for the text prompt "Vincent Van Gogh". Frames edited using blended diffusion (row B, BD), without any form of control, align with the prompt but lack both geometric and photometric consistency. Using a depth-aware inpainting model (row C, $N=0$) achieves geometric alignment but suffers from photometric inconsistency. Iteratively projecting edited images to the next view and only inpainting occluded regions (row E, $N=20$) produces results that diverge as we get farther from the reference view; we show the projected pixels on top and the inpainted result below. Our hybrid scheme (row D, $N=5$) balances these two options by starting with the projection result but further refining it to preserve visual quality. Note, minimal inconsistencies are efficiently resolved with NeRF optimization, ensuring improved results.
  • Figure 4: Results. We present the results of our method on a diverse set of scenes. For each scene, we show input views on the left and results obtained from different text prompts after that.
  • Figure 5: Comparisons. We compare Instruct-NeRF2NeRF haque2023instruct, with and without our masks (columns 2 and 3), ViCA-NeRF vica with our masks (column 4) and our approach both with and without projection inpainting (columns 5 and 6). Our full method allows for drastic and more consistent edits, for e.g., the textures of the plaid shirt and clown costume, the rainbow on the teddy bear, and the checkerboard pattern on the table.
  • ...and 10 more figures