Table of Contents
Fetching ...

Inpaint4DNeRF: Promptable Spatio-Temporal NeRF Inpainting with Generative Diffusion Models

Han Jiang, Haosen Sun, Ruoxuan Li, Chi-Keung Tang, Yu-Wing Tai

TL;DR

Inpaint4DNeRF tackles the challenge of text-guided generative inpainting within NeRF scenes by leveraging diffusion models to fill occluded backgrounds after foreground edits. The method seeds a small set of views with diffusion-based inpainting, derives coarse geometry proxies, and enforces strong cross-view and temporal consistency to refine a final $3D$ NeRF, extendable to $4D$ dynamic scenes. Its three-stage pipeline—seed-view pre-processing, warmup training, and Iterative Dataset Update—stitches together diffusion-guided content with NeRF finetuning, achieving plausible geometry and appearance while maintaining background fidelity. The approach demonstrates promising qualitative results in both static and dynamic scenes, with ablations validating the importance of seed pre-processing, depth supervision, and diffusion-based refinement for robust multiview and temporal coherence. This work advances practical editing of complex scenes by enabling promptable, background-consistent inpainting within NeRF and offering a foundation for further improvements in $4D$ consistency and geometry generation.

Abstract

Current Neural Radiance Fields (NeRF) can generate photorealistic novel views. For editing 3D scenes represented by NeRF, with the advent of generative models, this paper proposes Inpaint4DNeRF to capitalize on state-of-the-art stable diffusion models (e.g., ControlNet) for direct generation of the underlying completed background content, regardless of static or dynamic. The key advantages of this generative approach for NeRF inpainting are twofold. First, after rough mask propagation, to complete or fill in previously occluded content, we can individually generate a small subset of completed images with plausible content, called seed images, from which simple 3D geometry proxies can be derived. Second and the remaining problem is thus 3D multiview consistency among all completed images, now guided by the seed images and their 3D proxies. Without other bells and whistles, our generative Inpaint4DNeRF baseline framework is general which can be readily extended to 4D dynamic NeRFs, where temporal consistency can be naturally handled in a similar way as our multiview consistency.

Inpaint4DNeRF: Promptable Spatio-Temporal NeRF Inpainting with Generative Diffusion Models

TL;DR

Inpaint4DNeRF tackles the challenge of text-guided generative inpainting within NeRF scenes by leveraging diffusion models to fill occluded backgrounds after foreground edits. The method seeds a small set of views with diffusion-based inpainting, derives coarse geometry proxies, and enforces strong cross-view and temporal consistency to refine a final NeRF, extendable to dynamic scenes. Its three-stage pipeline—seed-view pre-processing, warmup training, and Iterative Dataset Update—stitches together diffusion-guided content with NeRF finetuning, achieving plausible geometry and appearance while maintaining background fidelity. The approach demonstrates promising qualitative results in both static and dynamic scenes, with ablations validating the importance of seed pre-processing, depth supervision, and diffusion-based refinement for robust multiview and temporal coherence. This work advances practical editing of complex scenes by enabling promptable, background-consistent inpainting within NeRF and offering a foundation for further improvements in consistency and geometry generation.

Abstract

Current Neural Radiance Fields (NeRF) can generate photorealistic novel views. For editing 3D scenes represented by NeRF, with the advent of generative models, this paper proposes Inpaint4DNeRF to capitalize on state-of-the-art stable diffusion models (e.g., ControlNet) for direct generation of the underlying completed background content, regardless of static or dynamic. The key advantages of this generative approach for NeRF inpainting are twofold. First, after rough mask propagation, to complete or fill in previously occluded content, we can individually generate a small subset of completed images with plausible content, called seed images, from which simple 3D geometry proxies can be derived. Second and the remaining problem is thus 3D multiview consistency among all completed images, now guided by the seed images and their 3D proxies. Without other bells and whistles, our generative Inpaint4DNeRF baseline framework is general which can be readily extended to 4D dynamic NeRFs, where temporal consistency can be naturally handled in a similar way as our multiview consistency.
Paper Structure (13 sections, 1 equation, 6 figures)

This paper contains 13 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Baseline Overview. Our generative NeRF inpainting is based on the inpainted image of one training view. The other seed images and training images are obtained by using stable diffusion to hallucinate the corrupted detials of the unproject-projected raw image. These images are then used to finetune the NeRF, with warmup training to get geometric and coarse appearance convergence, followed by iterative training image update to get fine convergence. For 4D extension, we first obtain a temporally consistent inpainted seed video based on the first seed image. Then for each frame, we infer inpainted images on other views by projection and correction, as in our 3D baseline.
  • Figure 2: Our qualitative results in 3D. Each column illustrates an inpainting example. We show final renderings from 2 views to demonstrate the multiview consistency. We also show depth maps and rgb images of different training stages to show their roles.
  • Figure 3: 4D NeRF Inpainting example. Text prompt: "a golden sword, side view". The first column corresponds to the first frame which includes the first seed image, and the other columns correspond to 2 later frames. Inpaint4DNeRF can generate a moving object that is overall consistent.
  • Figure 4: Training results with view independent inpainting. Left: rgb render. Right: noisy and incorrect depth map.
  • Figure 5: Training results with instruct-nerf2nerf. Left: Result from our baseline. Right: result from warmup training followed by instruct-nerf2nerf.
  • ...and 1 more figures