Table of Contents
Fetching ...

InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes

Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, Federico Tombari

TL;DR

InseRF addresses the challenge of inserting new semantic objects into NeRF-reconstructed scenes in a 3D-consistent manner, guided by a text description and a single 2D bounding box. It grounds the 3D insertion in a 2D reference view edited with a diffusion-based inpainting model, lifts that edit to a 3D object via single-view reconstruction, places the object using monocular depth with scale and distance optimization, and fuses the object and scene NeRFs with a scale-aware density blend, optionally refining with view-aware NeRF optimization. The method demonstrates improved 3D consistency and localized insertions compared with strong baselines like Instruct-NeRF2NeRF and MV-Inpainting, while requiring minimal explicit 3D input. This has practical impact for interactive 3D scene editing, enabling natural language-driven augmentation of complex scenes without full 3D priors.

Abstract

We introduce InseRF, a novel method for generative object insertion in the NeRF reconstructions of 3D scenes. Based on a user-provided textual description and a 2D bounding box in a reference viewpoint, InseRF generates new objects in 3D scenes. Recently, methods for 3D scene editing have been profoundly transformed, owing to the use of strong priors of text-to-image diffusion models in 3D generative modeling. Existing methods are mostly effective in editing 3D scenes via style and appearance changes or removing existing objects. Generating new objects, however, remains a challenge for such methods, which we address in this study. Specifically, we propose grounding the 3D object insertion to a 2D object insertion in a reference view of the scene. The 2D edit is then lifted to 3D using a single-view object reconstruction method. The reconstructed object is then inserted into the scene, guided by the priors of monocular depth estimation methods. We evaluate our method on various 3D scenes and provide an in-depth analysis of the proposed components. Our experiments with generative insertion of objects in several 3D scenes indicate the effectiveness of our method compared to the existing methods. InseRF is capable of controllable and 3D-consistent object insertion without requiring explicit 3D information as input. Please visit our project page at https://mohamad-shahbazi.github.io/inserf.

InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes

TL;DR

InseRF addresses the challenge of inserting new semantic objects into NeRF-reconstructed scenes in a 3D-consistent manner, guided by a text description and a single 2D bounding box. It grounds the 3D insertion in a 2D reference view edited with a diffusion-based inpainting model, lifts that edit to a 3D object via single-view reconstruction, places the object using monocular depth with scale and distance optimization, and fuses the object and scene NeRFs with a scale-aware density blend, optionally refining with view-aware NeRF optimization. The method demonstrates improved 3D consistency and localized insertions compared with strong baselines like Instruct-NeRF2NeRF and MV-Inpainting, while requiring minimal explicit 3D input. This has practical impact for interactive 3D scene editing, enabling natural language-driven augmentation of complex scenes without full 3D priors.

Abstract

We introduce InseRF, a novel method for generative object insertion in the NeRF reconstructions of 3D scenes. Based on a user-provided textual description and a 2D bounding box in a reference viewpoint, InseRF generates new objects in 3D scenes. Recently, methods for 3D scene editing have been profoundly transformed, owing to the use of strong priors of text-to-image diffusion models in 3D generative modeling. Existing methods are mostly effective in editing 3D scenes via style and appearance changes or removing existing objects. Generating new objects, however, remains a challenge for such methods, which we address in this study. Specifically, we propose grounding the 3D object insertion to a 2D object insertion in a reference view of the scene. The 2D edit is then lifted to 3D using a single-view object reconstruction method. The reconstructed object is then inserted into the scene, guided by the priors of monocular depth estimation methods. We evaluate our method on various 3D scenes and provide an in-depth analysis of the proposed components. Our experiments with generative insertion of objects in several 3D scenes indicate the effectiveness of our method compared to the existing methods. InseRF is capable of controllable and 3D-consistent object insertion without requiring explicit 3D information as input. Please visit our project page at https://mohamad-shahbazi.github.io/inserf.
Paper Structure (24 sections, 19 equations, 9 figures, 1 table)

This paper contains 24 sections, 19 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of the proposed method. Given a single reference view annotated with a 2D bounding box and a text prompt describing the object to be inserted, a 2D edit is generated portraying a view of the object. This 2D edit is then warped to a 3D model of the object and placed into the scene using the procedure described in section \ref{['method:placement']}. After the 3D placement, the object and scene representations are fused as described in section \ref{['method:fusion']}. Finally, an optional refinement can be performed to further improve the appearance.
  • Figure 2: Examples of using InseRF to insert an object into the neural representation of different indoor and outdoor scenes.
  • Figure 3: Qualitative comparison of object insertion with different methods. I-N2N modifies existing objects instead of inserting a new object, and the inpainting baseline fails to create geometry at the desired location. Our method, in contrast, can insert new 3D-consistent objects at the desired location.
  • Figure 4: Visualization of the effect of scale optimisation on object insertion. The placement of objects is more realistic and faithful to the original edit when performing scale/distance optimization to improve the alignment.
  • Figure 5: Visualization of the effect of scaling the densities when fusing the object and scene representation. When the re-scaling of the object NeRF is not accounted for in the volumetric rendering, the object is not properly displayed in the synthesized views.
  • ...and 4 more figures