Table of Contents
Fetching ...

NeRF-Insert: 3D Local Editing with Multimodal Control Signals

Benet Oriol Sabat, Alessandro Achille, Matthew Trager, Stefano Soatto

TL;DR

NeRF-Insert presents a local 3D editing framework for Neural Radiance Fields that uses inpainting to alter only a user-specified region, enabling multimodal conditioning from text, images, masks, or CAD models. It distills 2D inpainting results into a 3D NeRF via an Iterative Dataset Update, augmented by a visual hull region representation and a spatially constrained loss to preserve unedited areas. The approach supports noise-annealed diffusion and mask projection to ensure 3D consistency and high-quality edits, outperforming prior methods like Instruct-NeRF2NeRF in locality and visual fidelity, while trading off some inter-frame consistency. Together, these elements deliver flexible, precise 3D editing capabilities with practical impact for augmented reality, film, and synthetic data generation, while acknowledging limitations of SDS-based edits and suggesting avenues for richer region control and segmentation-based inputs.

Abstract

We propose NeRF-Insert, a NeRF editing framework that allows users to make high-quality local edits with a flexible level of control. Unlike previous work that relied on image-to-image models, we cast scene editing as an in-painting problem, which encourages the global structure of the scene to be preserved. Moreover, while most existing methods use only textual prompts to condition edits, our framework accepts a combination of inputs of different modalities as reference. More precisely, a user may provide a combination of textual and visual inputs including images, CAD models, and binary image masks for specifying a 3D region. We use generic image generation models to in-paint the scene from multiple viewpoints, and lift the local edits to a 3D-consistent NeRF edit. Compared to previous methods, our results show better visual quality and also maintain stronger consistency with the original NeRF.

NeRF-Insert: 3D Local Editing with Multimodal Control Signals

TL;DR

NeRF-Insert presents a local 3D editing framework for Neural Radiance Fields that uses inpainting to alter only a user-specified region, enabling multimodal conditioning from text, images, masks, or CAD models. It distills 2D inpainting results into a 3D NeRF via an Iterative Dataset Update, augmented by a visual hull region representation and a spatially constrained loss to preserve unedited areas. The approach supports noise-annealed diffusion and mask projection to ensure 3D consistency and high-quality edits, outperforming prior methods like Instruct-NeRF2NeRF in locality and visual fidelity, while trading off some inter-frame consistency. Together, these elements deliver flexible, precise 3D editing capabilities with practical impact for augmented reality, film, and synthetic data generation, while acknowledging limitations of SDS-based edits and suggesting avenues for richer region control and segmentation-based inputs.

Abstract

We propose NeRF-Insert, a NeRF editing framework that allows users to make high-quality local edits with a flexible level of control. Unlike previous work that relied on image-to-image models, we cast scene editing as an in-painting problem, which encourages the global structure of the scene to be preserved. Moreover, while most existing methods use only textual prompts to condition edits, our framework accepts a combination of inputs of different modalities as reference. More precisely, a user may provide a combination of textual and visual inputs including images, CAD models, and binary image masks for specifying a 3D region. We use generic image generation models to in-paint the scene from multiple viewpoints, and lift the local edits to a 3D-consistent NeRF edit. Compared to previous methods, our results show better visual quality and also maintain stronger consistency with the original NeRF.
Paper Structure (21 sections, 1 equation, 7 figures, 1 table)

This paper contains 21 sections, 1 equation, 7 figures, 1 table.

Figures (7)

  • Figure 1: NeRF-Insert is a flexible framework for NeRF inpainting with different control modalities. A user can specify a 3D region with two or more manually-drawn image masks or by positioning a mesh/CAD model on the scene. Moreover, inpainting can be controlled with a textual prompt or with a reference image that influences the appearance of the inserted object or edited region.
  • Figure 2: NeRF-Insert accepts a variety of conditioning inputs, which can be seen as an spectrum of levels of control. For example, a user can specify: 1) a textual description of an object and a rough 3D region where it should be inserted (via image masks); 2) a textual description of the object with its shape and pose determined by a CAD model; 3) additionally influence the appearance of the object via a reference image. In contrast, text-based editing methods such as Instruct-NeRF2NeRF do not afford the same flexibility. Our framework is generic enough to potentially incorporate other kinds of inpainting control modalities, for example masks from a segmentation model.
  • Figure 3: Overview of NeRF-Insert. We use a small number of manually-drawn masks or a posed mesh to define the 3D region of space to edit. This 3D region is projected onto the training views to obtain inpainting masks fo all of the training images. We render the NeRF from the training viewpoints and inpaint them using Stable Diffusion or Paint-by-Example. We then replace the previous images in the training pipeline of the NeRF with the inpainted images.
  • Figure 4: We lift three manually-drawn masks to a 3D representation that can be rendered from an arbitrary viewpoint. As we see in the leftmost and rightmost masks in the bottom row, our projection accounts for occlusions in the existing scene.
  • Figure 5: Inpainting results. On the left side there is the textual or visual prompts used for Stable Diffusion inpainting or Paint-by-Example, respectively. Additionally, to specify the inpainting 3D region, b), c), e), h), i), n) rely on 3 manually drawn masks, f) relies on a geometrically accurate mesh of a vase while k), l) o), q), r) rely on geometrically coarse meshes such as a cube, sphere or cylinder. Refer to the supplementary material for more details about masks and meshes used for each example.
  • ...and 2 more figures