Table of Contents
Fetching ...

ViCA-NeRF: View-Consistency-Aware 3D Editing of Neural Radiance Fields

Jiahua Dong, Yu-Xiong Wang

TL;DR

ViCA-NeRF addresses the challenge of multi-view consistent 3D editing of NeRFs under text instructions. It introduces two regularization signals—depth-guided geometry and learned latent alignment in a 2D diffusion model—to propagate edits from edited key views to the full scene. The method operates in two stages: first, editing key views and blending into a coherent dataset; second, refining the NeRF with the updated data, aided by warm-up and post-refinement. Experiments show improved consistency and detail and a speedup over Instruct-NeRF2NeRF, enabling efficient, controllable 3D editing across diverse scenes. The approach provides publicly available code and broad applicability to real-world editing tasks.

Abstract

We introduce ViCA-NeRF, the first view-consistency-aware method for 3D editing with text instructions. In addition to the implicit neural radiance field (NeRF) modeling, our key insight is to exploit two sources of regularization that explicitly propagate the editing information across different views, thus ensuring multi-view consistency. For geometric regularization, we leverage the depth information derived from NeRF to establish image correspondences between different views. For learned regularization, we align the latent codes in the 2D diffusion model between edited and unedited images, enabling us to edit key views and propagate the update throughout the entire scene. Incorporating these two strategies, our ViCA-NeRF operates in two stages. In the initial stage, we blend edits from different views to create a preliminary 3D edit. This is followed by a second stage of NeRF training, dedicated to further refining the scene's appearance. Experimental results demonstrate that ViCA-NeRF provides more flexible, efficient (3 times faster) editing with higher levels of consistency and details, compared with the state of the art. Our code is publicly available.

ViCA-NeRF: View-Consistency-Aware 3D Editing of Neural Radiance Fields

TL;DR

ViCA-NeRF addresses the challenge of multi-view consistent 3D editing of NeRFs under text instructions. It introduces two regularization signals—depth-guided geometry and learned latent alignment in a 2D diffusion model—to propagate edits from edited key views to the full scene. The method operates in two stages: first, editing key views and blending into a coherent dataset; second, refining the NeRF with the updated data, aided by warm-up and post-refinement. Experiments show improved consistency and detail and a speedup over Instruct-NeRF2NeRF, enabling efficient, controllable 3D editing across diverse scenes. The approach provides publicly available code and broad applicability to real-world editing tasks.

Abstract

We introduce ViCA-NeRF, the first view-consistency-aware method for 3D editing with text instructions. In addition to the implicit neural radiance field (NeRF) modeling, our key insight is to exploit two sources of regularization that explicitly propagate the editing information across different views, thus ensuring multi-view consistency. For geometric regularization, we leverage the depth information derived from NeRF to establish image correspondences between different views. For learned regularization, we align the latent codes in the 2D diffusion model between edited and unedited images, enabling us to edit key views and propagate the update throughout the entire scene. Incorporating these two strategies, our ViCA-NeRF operates in two stages. In the initial stage, we blend edits from different views to create a preliminary 3D edit. This is followed by a second stage of NeRF training, dedicated to further refining the scene's appearance. Experimental results demonstrate that ViCA-NeRF provides more flexible, efficient (3 times faster) editing with higher levels of consistency and details, compared with the state of the art. Our code is publicly available.
Paper Structure (30 sections, 7 equations, 17 figures, 2 tables)

This paper contains 30 sections, 7 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Our ViCA-NeRF is the first work that achieves multi-view consistent 3D editing with text instructions, applicable across a broad range of scenes and instructions. Moreover, ViCA-NeRF exhibits controllability, allowing for early control of final results by editing key views. Notably, ViCA-NeRF is also efficient, surpassing state-of-the-art Instruct-NeRF2NeRF by being 3 times faster.
  • Figure 2: Overview of our ViCA-NeRF. Our proposed method decouples NeRF editing into two stages. In the first stage, we sample several key views and edit them through Instruct-Pix2Pix. Then, we use the depth map and camera poses to project edited keyframes to other views and obtain a mixup dataset. These images are further refined through our blending model. In the second stage, the edited dataset is directly used to train the NeRF model. Optionally, we can conduct refinement to the dataset according to the updated NeRF.
  • Figure 3: Illustration of mixup procedure and blending model. We first mix up the image with the edited key views. Then, we introduce a blending model to further refine it. The blending model utilizes two modified Instruct-Pix2Pix ('Inp2p') processes. In each process, we generate multiple results and take their average on the latent code to decode a single final result.
  • Figure 4: Qualitative comparison with Instruct-NeRF2NeRF. Our ViCA-NeRF provides more details compared with Instruct-NeRF2NeRF. In addition, our ViCA-NeRF can handle challenging prompts, such as "Turn him into a robot," whereas Instruct-NeRF2NeRF fails in such cases.
  • Figure 5: Comparison on NeRF-Art. We compare the editing results based on NeRF-Art's nerfart sequences and edits. Our ViCA-NeRF produces more detailed information and achieves more substantial changes to the content.
  • ...and 12 more figures