Table of Contents
Fetching ...

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields

Ori Gordon, Omri Avrahami, Dani Lischinski

TL;DR

Blended-NeRF addresses the challenge of local, text-guided edits in NeRF scenes by introducing an ROI-based 3D object generator initialized from an existing NeRF and trained inside a user-specified region under CLIP supervision. The edited content is then blended with the original radiance field along camera rays using a novel volumetric blending scheme and a distance-aware smoothing operator, yielding natural, view-consistent results. The approach leverages priors from Dream Fields, including depth regularization, pose sampling, and directional prompts, to achieve high fidelity and realism. Quantitative and qualitative evaluations show improvements over prior local-editing baselines, enabling applications such as object insertion, replacement, blending, and texture editing in real-world 3D scenes, with potential for broader 3D editing tasks.

Abstract

Editing a local region or a specific object in a 3D scene represented by a NeRF or consistently blending a new realistic object into the scene is challenging, mainly due to the implicit nature of the scene representation. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.

Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields

TL;DR

Blended-NeRF addresses the challenge of local, text-guided edits in NeRF scenes by introducing an ROI-based 3D object generator initialized from an existing NeRF and trained inside a user-specified region under CLIP supervision. The edited content is then blended with the original radiance field along camera rays using a novel volumetric blending scheme and a distance-aware smoothing operator, yielding natural, view-consistent results. The approach leverages priors from Dream Fields, including depth regularization, pose sampling, and directional prompts, to achieve high fidelity and realism. Quantitative and qualitative evaluations show improvements over prior local-editing baselines, enabling applications such as object insertion, replacement, blending, and texture editing in real-world 3D scenes, with potential for broader 3D editing tasks.

Abstract

Editing a local region or a specific object in a 3D scene represented by a NeRF or consistently blending a new realistic object into the scene is challenging, mainly due to the implicit nature of the scene representation. We present Blended-NeRF, a robust and flexible framework for editing a specific region of interest in an existing NeRF scene, based on text prompts, along with a 3D ROI box. Our method leverages a pretrained language-image model to steer the synthesis towards a user-provided text prompt, along with a 3D MLP model initialized on an existing NeRF scene to generate the object and blend it into a specified region in the original scene. We allow local editing by localizing a 3D ROI box in the input scene, and blend the content synthesized inside the ROI with the existing scene using a novel volumetric blending technique. To obtain natural looking and view-consistent results, we leverage existing and new geometric priors and 3D augmentations for improving the visual fidelity of the final result. We test our framework both qualitatively and quantitatively on a variety of real 3D scenes and text prompts, demonstrating realistic multi-view consistent results with much flexibility and diversity compared to the baselines. Finally, we show the applicability of our framework for several 3D editing applications, including adding new objects to a scene, removing/replacing/altering existing objects, and texture conversion.
Paper Structure (21 sections, 16 equations, 14 figures, 2 tables)

This paper contains 21 sections, 16 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Overview. (a) Training: Given a NeRF scene $F_{\theta}^{O}$, our pipeline trains a NeRF generator model $F_{\theta}^{G}$, initialized with $F_{\theta}^{O}$ weights and guided by a similarity loss defined by a language-image model such as CLIP CLIP2022, to synthesize a new object inside a user-specified ROI. This is achieved by casting rays and sampling points for the rendering process NeRF2020 only inside the ROI box. Our method introduces augmentations and priors to get more natural results. (b) Blending process: After training, we render the edited scene by blending the sample points generated by the two models along each view ray.
  • Figure 2: Large object replacement. Here we preform object replacement to the blender ship scene by localizing the ROI box to include the sea and the bottom of the ship and training our model to steer the edit towards the given text prompts.
  • Figure 3: Distance Smoothing Operator. We demonstrate our suggested smoothing operator in \ref{['distance_blending']} on a range of $\alpha$ values, When $\alpha$ is zero all the weight goes to the edited scene, and as we increase $\alpha$, more attention is given to closer points from the original scene.
  • Figure 4: Blending Modes. Guided by "plant with green leaves and white and blue flowers". When using \ref{['sigma_in_eqn']} (second column), we allow $F_{\theta}^{G}$ to change the density of the original scene, in this case removing parts of the wheel. When utilizing \ref{['sigma_out_eqn']} (third column), we can only add additionally density to the scene, so the plant warps around the wheel without changing it.
  • Figure 5: Comparison to VolumeDisentanglement2022 for object replacement. We compare our editing capabilities to VolumeDisentanglement2022 in the fern scene from llff dataset mildenhall2019llff. The left and right images in each row are VolumeDisentanglement2022 and ours, accordingly. Our proposed method exhibits more realistic results that agrees better with the text. For example the edit for the text "aspen tree" indeed looks like a trunk of an aspen tree in our edit.
  • ...and 9 more figures