Table of Contents
Fetching ...

MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior

Honghua Chen, Chen Change Loy, Xingang Pan

TL;DR

MVIP-NeRF addresses the challenge of inpainting NeRF scenes with view-consistent appearance and geometry by leveraging diffusion priors in a joint RGB and normal-map optimization framework. It introduces appearance and geometry diffusion priors within the SDS paradigm and a multi-view SDS score to stabilize completion under large view changes, along with a smoothed normal-based geometry representation. The method achieves improved appearance realism (LPIPS) and geometry coherence over state-of-the-art NeRF inpainting methods, demonstrated on Real-S and Real-L datasets through extensive ablations and analyses of normal vs depth guidance and single- vs multi-view distillation. This diffusion-prior–driven approach reduces dependency on explicit per-view inpainting but incurs higher computational cost and requires careful CFG/temporal-scheduling tuning, highlighting practical trade-offs for 3D content restoration in NeRF scenes.

Abstract

Despite the emergence of successful NeRF inpainting methods built upon explicit RGB and depth 2D inpainting supervisions, these methods are inherently constrained by the capabilities of their underlying 2D inpainters. This is due to two key reasons: (i) independently inpainting constituent images results in view-inconsistent imagery, and (ii) 2D inpainters struggle to ensure high-quality geometry completion and alignment with inpainted RGB images. To overcome these limitations, we propose a novel approach called MVIP-NeRF that harnesses the potential of diffusion priors for NeRF inpainting, addressing both appearance and geometry aspects. MVIP-NeRF performs joint inpainting across multiple views to reach a consistent solution, which is achieved via an iterative optimization process based on Score Distillation Sampling (SDS). Apart from recovering the rendered RGB images, we also extract normal maps as a geometric representation and define a normal SDS loss that motivates accurate geometry inpainting and alignment with the appearance. Additionally, we formulate a multi-view SDS score function to distill generative priors simultaneously from different view images, ensuring consistent visual completion when dealing with large view variations. Our experimental results show better appearance and geometry recovery than previous NeRF inpainting methods.

MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior

TL;DR

MVIP-NeRF addresses the challenge of inpainting NeRF scenes with view-consistent appearance and geometry by leveraging diffusion priors in a joint RGB and normal-map optimization framework. It introduces appearance and geometry diffusion priors within the SDS paradigm and a multi-view SDS score to stabilize completion under large view changes, along with a smoothed normal-based geometry representation. The method achieves improved appearance realism (LPIPS) and geometry coherence over state-of-the-art NeRF inpainting methods, demonstrated on Real-S and Real-L datasets through extensive ablations and analyses of normal vs depth guidance and single- vs multi-view distillation. This diffusion-prior–driven approach reduces dependency on explicit per-view inpainting but incurs higher computational cost and requires careful CFG/temporal-scheduling tuning, highlighting practical trade-offs for 3D content restoration in NeRF scenes.

Abstract

Despite the emergence of successful NeRF inpainting methods built upon explicit RGB and depth 2D inpainting supervisions, these methods are inherently constrained by the capabilities of their underlying 2D inpainters. This is due to two key reasons: (i) independently inpainting constituent images results in view-inconsistent imagery, and (ii) 2D inpainters struggle to ensure high-quality geometry completion and alignment with inpainted RGB images. To overcome these limitations, we propose a novel approach called MVIP-NeRF that harnesses the potential of diffusion priors for NeRF inpainting, addressing both appearance and geometry aspects. MVIP-NeRF performs joint inpainting across multiple views to reach a consistent solution, which is achieved via an iterative optimization process based on Score Distillation Sampling (SDS). Apart from recovering the rendered RGB images, we also extract normal maps as a geometric representation and define a normal SDS loss that motivates accurate geometry inpainting and alignment with the appearance. Additionally, we formulate a multi-view SDS score function to distill generative priors simultaneously from different view images, ensuring consistent visual completion when dealing with large view variations. Our experimental results show better appearance and geometry recovery than previous NeRF inpainting methods.
Paper Structure (22 sections, 11 equations, 10 figures, 5 tables)

This paper contains 22 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison of our MVIP-NeRF with two state-of-the-art approaches, Remove-NeRF weder2023removing and SPIn-NeRF mirzaei2023spin. Existing methods heavily depend on explicit RGB and depth inpainting results. This type of inpainting prior frequently shows inconsistency, inaccuracy, and misalignment to a certain degree (sub-figure (b)). In contrast, our approach implicitly exploits the diffusion prior (sub-figure (c)), resulting in more faithful and consistent results, in terms of both appearance and geometry.
  • Figure 2: Method overview. Given posed RGB images with corresponding masks, depth maps (optional), and a text description, MVIP-NeRF can faithfully recover plausible textures and accurate surface detail. In the optimization process, for unmasked regions, we employ direct pixel-wise RGB and depth reconstruction losses. For masked regions, we introduce an RGB and normal map co-filling approach, utilizing SDS losses. This approach iteratively completes and aligns the appearance and geometry of NeRF scenes without the need for explicit supervision. Furthermore, we implement a multi-view scoring mechanism within the diffusion process to effectively handle significant variations in viewpoints. Finally, novel views can be rendered from the NeRF scene, where the object has been removed.
  • Figure 3: Effect of different normal map generation methods. In the first column, we present the input image with a mask (black region) and the depth map generated by NeRF, optimized with unmasked pixels. The second column displays the normal map derived from the density field gradient and the corresponding optimized depth map. The final column highlights the improved accuracy and reliability of geometry reconstruction achieved through the use of a smoothed normal field.
  • Figure 4: Effect of multi-view score distillation. The first row shows inpainting results without the multi-view score, while the second row shows the results with the multi-view score ($N=5$).
  • Figure 5: Visual comparison with two representative approaches mirzaei2023spinweder2023removing on two scenes. The first scene is from the Real-S dataset with accurate masks, while the latter is from the Real-L dataset with large, roughly-covered masks. In the first scene, the input text prompt is "A stone bench" and for the second scene, it is "A brick wall". Our method effectively handles both types of scenes, successfully generating view-consistent scenes with valid geometries (see the bench shape) and detailed textures (see the brick seam).
  • ...and 5 more figures