Inpaint4DNeRF: Promptable Spatio-Temporal NeRF Inpainting with Generative Diffusion Models
Han Jiang, Haosen Sun, Ruoxuan Li, Chi-Keung Tang, Yu-Wing Tai
TL;DR
Inpaint4DNeRF tackles the challenge of text-guided generative inpainting within NeRF scenes by leveraging diffusion models to fill occluded backgrounds after foreground edits. The method seeds a small set of views with diffusion-based inpainting, derives coarse geometry proxies, and enforces strong cross-view and temporal consistency to refine a final $3D$ NeRF, extendable to $4D$ dynamic scenes. Its three-stage pipeline—seed-view pre-processing, warmup training, and Iterative Dataset Update—stitches together diffusion-guided content with NeRF finetuning, achieving plausible geometry and appearance while maintaining background fidelity. The approach demonstrates promising qualitative results in both static and dynamic scenes, with ablations validating the importance of seed pre-processing, depth supervision, and diffusion-based refinement for robust multiview and temporal coherence. This work advances practical editing of complex scenes by enabling promptable, background-consistent inpainting within NeRF and offering a foundation for further improvements in $4D$ consistency and geometry generation.
Abstract
Current Neural Radiance Fields (NeRF) can generate photorealistic novel views. For editing 3D scenes represented by NeRF, with the advent of generative models, this paper proposes Inpaint4DNeRF to capitalize on state-of-the-art stable diffusion models (e.g., ControlNet) for direct generation of the underlying completed background content, regardless of static or dynamic. The key advantages of this generative approach for NeRF inpainting are twofold. First, after rough mask propagation, to complete or fill in previously occluded content, we can individually generate a small subset of completed images with plausible content, called seed images, from which simple 3D geometry proxies can be derived. Second and the remaining problem is thus 3D multiview consistency among all completed images, now guided by the seed images and their 3D proxies. Without other bells and whistles, our generative Inpaint4DNeRF baseline framework is general which can be readily extended to 4D dynamic NeRFs, where temporal consistency can be naturally handled in a similar way as our multiview consistency.
