NeRF-Insert: 3D Local Editing with Multimodal Control Signals
Benet Oriol Sabat, Alessandro Achille, Matthew Trager, Stefano Soatto
TL;DR
NeRF-Insert presents a local 3D editing framework for Neural Radiance Fields that uses inpainting to alter only a user-specified region, enabling multimodal conditioning from text, images, masks, or CAD models. It distills 2D inpainting results into a 3D NeRF via an Iterative Dataset Update, augmented by a visual hull region representation and a spatially constrained loss to preserve unedited areas. The approach supports noise-annealed diffusion and mask projection to ensure 3D consistency and high-quality edits, outperforming prior methods like Instruct-NeRF2NeRF in locality and visual fidelity, while trading off some inter-frame consistency. Together, these elements deliver flexible, precise 3D editing capabilities with practical impact for augmented reality, film, and synthetic data generation, while acknowledging limitations of SDS-based edits and suggesting avenues for richer region control and segmentation-based inputs.
Abstract
We propose NeRF-Insert, a NeRF editing framework that allows users to make high-quality local edits with a flexible level of control. Unlike previous work that relied on image-to-image models, we cast scene editing as an in-painting problem, which encourages the global structure of the scene to be preserved. Moreover, while most existing methods use only textual prompts to condition edits, our framework accepts a combination of inputs of different modalities as reference. More precisely, a user may provide a combination of textual and visual inputs including images, CAD models, and binary image masks for specifying a 3D region. We use generic image generation models to in-paint the scene from multiple viewpoints, and lift the local edits to a 3D-consistent NeRF edit. Compared to previous methods, our results show better visual quality and also maintain stronger consistency with the original NeRF.
