NeRF-In: Free-Form NeRF Inpainting with RGB-D Priors
Hao-Kang Liu, I-Chao Shen, Bing-Yu Chen
TL;DR
This work tackles editing a pre-trained NeRF by removing unwanted objects or retouching regions without category-specific data or retraining. It introduces an RGB-D guided inpainting framework that transfers a user mask across multiple views, generates guiding color and depth content, and jointly optimizes NeRF parameters with color- and depth-guiding losses to achieve visually plausible, geometrically consistent inpainted scenes. Key contributions include a flexible guiding-material generation pipeline (STCN-based mask transfer, MST inpainting, and depth completion), and a joint NeRF optimization objective that enforces cross-view consistency while preserving non-masked regions. The approach demonstrates improved depth accuracy and view consistency on LLFF and custom datasets, highlighting its potential for practical 3D scene editing without category-specific data or training.
Abstract
Though Neural Radiance Field (NeRF) demonstrates compelling novel view synthesis results, it is still unintuitive to edit a pre-trained NeRF because the neural network's parameters and the scene geometry/appearance are often not explicitly associated. In this paper, we introduce the first framework that enables users to remove unwanted objects or retouch undesired regions in a 3D scene represented by a pre-trained NeRF without any category-specific data and training. The user first draws a free-form mask to specify a region containing unwanted objects over a rendered view from the pre-trained NeRF. Our framework first transfers the user-provided mask to other rendered views and estimates guiding color and depth images within these transferred masked regions. Next, we formulate an optimization problem that jointly inpaints the image content in all masked regions across multiple views by updating the NeRF model's parameters. We demonstrate our framework on diverse scenes and show it obtained visual plausible and structurally consistent results across multiple views using shorter time and less user manual efforts.
