Table of Contents
Fetching ...

ProteusNeRF: Fast Lightweight NeRF Editing using 3D-Aware Image Context

Binglun Wang, Niladri Shekhar Dutt, Niloy J. Mitra

TL;DR

ProteusNeRF addresses the challenge of editing NeRF content at interactive rates without converting to meshes by introducing TriPlaneLite and a 3D-aware image context to enforce view-consistent edits. The method distills semantic features for object selection and uses residual MLPs to apply appearance edits, while larger edits jointly refine geometry and appearance via iterative NeRF re-training guided by 3D-aware context and diffusion-based edits. Key contributions include a lightweight residual editing mechanism (~$4$–$36$KB per edit), a 3D-aware 2×2 image context for cross-view consistency, and substantial speedups ($10$–$70$ seconds per edit with $10$–$30$× improvement over comparable methods). The approach enables layered edits and interactive workflows with practical memory footprints, enabling rapid experimentation and iteration for NeRF editing, though it still faces challenges with large geometric changes and specular effects.

Abstract

Neural Radiance Fields (NeRFs) have recently emerged as a popular option for photo-realistic object capture due to their ability to faithfully capture high-fidelity volumetric content even from handheld video input. Although much research has been devoted to efficient optimization leading to real-time training and rendering, options for interactive editing NeRFs remain limited. We present a very simple but effective neural network architecture that is fast and efficient while maintaining a low memory footprint. This architecture can be incrementally guided through user-friendly image-based edits. Our representation allows straightforward object selection via semantic feature distillation at the training stage. More importantly, we propose a local 3D-aware image context to facilitate view-consistent image editing that can then be distilled into fine-tuned NeRFs, via geometric and appearance adjustments. We evaluate our setup on a variety of examples to demonstrate appearance and geometric edits and report 10-30x speedup over concurrent work focusing on text-guided NeRF editing. Video results can be seen on our project webpage at https://proteusnerf.github.io.

ProteusNeRF: Fast Lightweight NeRF Editing using 3D-Aware Image Context

TL;DR

ProteusNeRF addresses the challenge of editing NeRF content at interactive rates without converting to meshes by introducing TriPlaneLite and a 3D-aware image context to enforce view-consistent edits. The method distills semantic features for object selection and uses residual MLPs to apply appearance edits, while larger edits jointly refine geometry and appearance via iterative NeRF re-training guided by 3D-aware context and diffusion-based edits. Key contributions include a lightweight residual editing mechanism (~KB per edit), a 3D-aware 2×2 image context for cross-view consistency, and substantial speedups ( seconds per edit with × improvement over comparable methods). The approach enables layered edits and interactive workflows with practical memory footprints, enabling rapid experimentation and iteration for NeRF editing, though it still faces challenges with large geometric changes and specular effects.

Abstract

Neural Radiance Fields (NeRFs) have recently emerged as a popular option for photo-realistic object capture due to their ability to faithfully capture high-fidelity volumetric content even from handheld video input. Although much research has been devoted to efficient optimization leading to real-time training and rendering, options for interactive editing NeRFs remain limited. We present a very simple but effective neural network architecture that is fast and efficient while maintaining a low memory footprint. This architecture can be incrementally guided through user-friendly image-based edits. Our representation allows straightforward object selection via semantic feature distillation at the training stage. More importantly, we propose a local 3D-aware image context to facilitate view-consistent image editing that can then be distilled into fine-tuned NeRFs, via geometric and appearance adjustments. We evaluate our setup on a variety of examples to demonstrate appearance and geometric edits and report 10-30x speedup over concurrent work focusing on text-guided NeRF editing. Video results can be seen on our project webpage at https://proteusnerf.github.io.
Paper Structure (19 sections, 5 equations, 13 figures, 3 tables)

This paper contains 19 sections, 5 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Visual comparison of color editing. CLIP-NeRF wang2022clip sees color bleeding into the global scene, and DFF kobayashi2022decomposing shows undesirable color changes in the pistil and unnatural color gradient. Our approach matches the impressive results of RecolorNeRF gong2023recolornerf, while offering a more intuitive and flexible framework and taking a fraction of time for editing (10 seconds vs 2-3 minutes for RecolorNeRF).
  • Figure 2: We present ProteusNeRF that takes in a set of posed images and encodes it as feature-distilled NeRF in a TriplaneLite representation. The user can easily select a part (yellow legos) that gets converted to a 3D mask $\mathcal{M}_\text{sel}$. We generate a novel 3D-aware image context that allows editing via imaging tools while still producing view-coherent edits. This edited context is then converted back to view-consistent NeRFs by fine-tuning the TriplaneLite. The context image is updated and the process is iterated (2-3 times in our examples). Editing, primarily appearance editing, runs at interactive framerates.
  • Figure 3: Once the input posed images $\{I_i, C_i\}_{i=1:n}$ are feature-distilled into TriplaneLite, the user can select a region in any of the images (shown in orange here), which is then used to extract a 3D mask $\mathcal{M}_\text{sel}$. Suppressing the corresponding signal in the mask, reveals the background across views.
  • Figure 4: We encode an object as a NeRF using our TriplaneLite structure that takes as input a point $\mathbf{p}:=(x,y,z)$ and encodes it as features $h(\mathbf{p})$ by projection and interpolation of features from three planar grids ($P_{xy}, P_{yz}, P_{xz}$). We then enable learning via four different MLPs $\phi_{geom}, \phi_{sem}, \phi_{color},$ and $\phi_{edit}$, to factorize density, semantic features, color, and residual appearance respectively. Training is supervised via photometric loss and distillation of image space semantic features (DINO features). This enables semantic selection (see Figure \ref{['fig:selection']}). Furthermore, the structuring lets us interactively receive appearance updates while requiring a low memory overhead (36KB/edit).
  • Figure 5: Although our 3D-aware image context helps to synchronize edits across nearby views, inconsistencies can still occur. We iterate between context-guided image edits, distillation into a refined NeRF, and regenerating new guidance images. Typically, we found that 2-3 iterations was enough to strike a balance between expressive edits and interactive performance.
  • ...and 8 more figures