Table of Contents
Fetching ...

NeRF-In: Free-Form NeRF Inpainting with RGB-D Priors

Hao-Kang Liu, I-Chao Shen, Bing-Yu Chen

TL;DR

This work tackles editing a pre-trained NeRF by removing unwanted objects or retouching regions without category-specific data or retraining. It introduces an RGB-D guided inpainting framework that transfers a user mask across multiple views, generates guiding color and depth content, and jointly optimizes NeRF parameters with color- and depth-guiding losses to achieve visually plausible, geometrically consistent inpainted scenes. Key contributions include a flexible guiding-material generation pipeline (STCN-based mask transfer, MST inpainting, and depth completion), and a joint NeRF optimization objective that enforces cross-view consistency while preserving non-masked regions. The approach demonstrates improved depth accuracy and view consistency on LLFF and custom datasets, highlighting its potential for practical 3D scene editing without category-specific data or training.

Abstract

Though Neural Radiance Field (NeRF) demonstrates compelling novel view synthesis results, it is still unintuitive to edit a pre-trained NeRF because the neural network's parameters and the scene geometry/appearance are often not explicitly associated. In this paper, we introduce the first framework that enables users to remove unwanted objects or retouch undesired regions in a 3D scene represented by a pre-trained NeRF without any category-specific data and training. The user first draws a free-form mask to specify a region containing unwanted objects over a rendered view from the pre-trained NeRF. Our framework first transfers the user-provided mask to other rendered views and estimates guiding color and depth images within these transferred masked regions. Next, we formulate an optimization problem that jointly inpaints the image content in all masked regions across multiple views by updating the NeRF model's parameters. We demonstrate our framework on diverse scenes and show it obtained visual plausible and structurally consistent results across multiple views using shorter time and less user manual efforts.

NeRF-In: Free-Form NeRF Inpainting with RGB-D Priors

TL;DR

This work tackles editing a pre-trained NeRF by removing unwanted objects or retouching regions without category-specific data or retraining. It introduces an RGB-D guided inpainting framework that transfers a user mask across multiple views, generates guiding color and depth content, and jointly optimizes NeRF parameters with color- and depth-guiding losses to achieve visually plausible, geometrically consistent inpainted scenes. Key contributions include a flexible guiding-material generation pipeline (STCN-based mask transfer, MST inpainting, and depth completion), and a joint NeRF optimization objective that enforces cross-view consistency while preserving non-masked regions. The approach demonstrates improved depth accuracy and view consistency on LLFF and custom datasets, highlighting its potential for practical 3D scene editing without category-specific data or training.

Abstract

Though Neural Radiance Field (NeRF) demonstrates compelling novel view synthesis results, it is still unintuitive to edit a pre-trained NeRF because the neural network's parameters and the scene geometry/appearance are often not explicitly associated. In this paper, we introduce the first framework that enables users to remove unwanted objects or retouch undesired regions in a 3D scene represented by a pre-trained NeRF without any category-specific data and training. The user first draws a free-form mask to specify a region containing unwanted objects over a rendered view from the pre-trained NeRF. Our framework first transfers the user-provided mask to other rendered views and estimates guiding color and depth images within these transferred masked regions. Next, we formulate an optimization problem that jointly inpaints the image content in all masked regions across multiple views by updating the NeRF model's parameters. We demonstrate our framework on diverse scenes and show it obtained visual plausible and structurally consistent results across multiple views using shorter time and less user manual efforts.
Paper Structure (22 sections, 10 equations, 10 figures)

This paper contains 22 sections, 10 equations, 10 figures.

Figures (10)

  • Figure 1: Given a pre-trained NeRF model, the user can (a) choose a view and (b) draw a mask to specify the unwanted object in the 3D scene. Our framework optimized the NeRF model based on user-provided mask and remove the unwanted object in the mask region. The optimized NeRF generated by our framework synthesize inpainted result resembles ground truth result in different views.
  • Figure 2: (a) Given a pre-trained NeRF $F_{\Theta}$, an user specifies the unwanted region on an user-chosen view with a user-drawn mask. Our framework sampled initial images and initial depth images and generate both guiding images and guiding depth images. (b) Our framework update $\Theta$ by optimizing both color-guiding loss ($L_{\text{color}}$) and depth-guiding loss ($L_{\text{depth}}$). ($\boldsymbol{\rightarrow}$ denotes render a view from a NeRF model and $\boldsymbol{\rightarrow}$ denotes updating $\Theta$ by optimizing losses.)
  • Figure 3: Our sampling strategy follows the trajectory in (a). Noted that the target scene faces toward the "+y" directon as shown in (b). Each blue dot represents a view we can sample, while the red dot represents the sample view used in the optimization framework. Also, we use sample images to construct the point cloud for better understanding. We conduct all the experiments based on this setting.
  • Figure 4: Qualitative comparison - LLFF dataset. For each scene, we show the user-chosen view image and the user-provided mask on the left. We then show the color image and depth image generated by different methods: our method (ours), baseline1 (b1), and baseline2 (b2). The depth map of b1 still keep depth of the unwanted object. Meanwhile, the color of b2 might cause noise or shadow on the scene(shown in horns). Our method, compared to these two baselines, have better color and correct geometry on final results.
  • Figure 5: Qualitative comparison - custom dataset. For each custom secne, we demonstrate the ground truth rendered image, results generated by our framework, baseline1 (b1), and baseline2 (b2). Our framework generates more accurate depth maps and synthesize more fine structures compared to baseline1. Compared to baseline2, our framework synthesizes more realistic and shape results.
  • ...and 5 more figures