Table of Contents
Fetching ...

NeRF Inpainting with Geometric Diffusion Prior and Balanced Score Distillation

Menglin Zhang, Xin Luo, Yunwei Lan, Chang Liu, Rui Li, Kaidong Zhang, Ganlin Yang, Dong Liu

TL;DR

GB-NeRF is introduced, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors and provides superior appearance fidelity and geometric consistency compared to existing approaches.

Abstract

Recent advances in NeRF inpainting have leveraged pretrained diffusion models to enhance performance. However, these methods often yield suboptimal results due to their ineffective utilization of 2D diffusion priors. The limitations manifest in two critical aspects: the inadequate capture of geometric information by pretrained diffusion models and the suboptimal guidance provided by existing Score Distillation Sampling (SDS) methods. To address these problems, we introduce GB-NeRF, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors. Our approach incorporates two key innovations: a fine-tuning strategy that simultaneously learns appearance and geometric priors and a specialized normal distillation loss that integrates these geometric priors into NeRF inpainting. We propose a technique called Balanced Score Distillation (BSD) that surpasses existing methods such as Score Distillation (SDS) and the improved version, Conditional Score Distillation (CSD). BSD offers improved inpainting quality in appearance and geometric aspects. Extensive experiments show that our method provides superior appearance fidelity and geometric consistency compared to existing approaches.

NeRF Inpainting with Geometric Diffusion Prior and Balanced Score Distillation

TL;DR

GB-NeRF is introduced, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors and provides superior appearance fidelity and geometric consistency compared to existing approaches.

Abstract

Recent advances in NeRF inpainting have leveraged pretrained diffusion models to enhance performance. However, these methods often yield suboptimal results due to their ineffective utilization of 2D diffusion priors. The limitations manifest in two critical aspects: the inadequate capture of geometric information by pretrained diffusion models and the suboptimal guidance provided by existing Score Distillation Sampling (SDS) methods. To address these problems, we introduce GB-NeRF, a novel framework that enhances NeRF inpainting through improved utilization of 2D diffusion priors. Our approach incorporates two key innovations: a fine-tuning strategy that simultaneously learns appearance and geometric priors and a specialized normal distillation loss that integrates these geometric priors into NeRF inpainting. We propose a technique called Balanced Score Distillation (BSD) that surpasses existing methods such as Score Distillation (SDS) and the improved version, Conditional Score Distillation (CSD). BSD offers improved inpainting quality in appearance and geometric aspects. Extensive experiments show that our method provides superior appearance fidelity and geometric consistency compared to existing approaches.

Paper Structure

This paper contains 14 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our GB-NeRF framework compared to MVIP-NeRF. Both approaches leverage appearance (A) and geometric (G) priors from diffusion models through score distillation. To enhance geometric accuracy, we introduce two key innovations: (1) a specialized fine-tuning strategy using RGB-normal image pairs; (2) Balanced Score Distillation (BSD), which eliminates high-variability terms present in existing methods like SDS poole2022dreamfusion and CSD yu2023text, providing more stable supervision for occluded regions. Compared to MVIP-NeRF MVIPNeRF, our method achieves superior consistency and accuracy in inpainted regions.
  • Figure 2: Method overview. (Left) Diffusion fine-tuning: Our approach utilizes the DIODE dataset diode_dataset, which provides high-quality RGB images and corresponding normal maps. Captions are generated from RGB images using BLIP li2022blip and shared with their corresponding normal maps to leverage Stable Diffusion's text understanding capabilities. Modality identifiers ('normal map' or 'RGB image') are prepended to these captions to distinguish between modalities. LoRA is integrated into both U-Net and text encoder to enhance the model's learning capacity further. (Right) NeRF inpainting: Given posed RGB images with corresponding masks and text descriptions, GB-NeRF reconstructs realistic textures and accurate geometry through dual supervision. In unmasked regions, direct pixel-wise RGB reconstruction loss ($L_{unma}$) provides supervision, while in masked areas, our BSD loss guides both RGB image and normal map generation using the fine-tuned diffusion model.
  • Figure 3: Impact of tuning coefficient $\omega_3$ on NeRF inpainting. The incorporation of unconditional noise prediction introduces excessive randomness, resulting in degraded inpainting quality.
  • Figure 4: Visual comparison with three representative approaches on two scenes. The first scene uses the prompt 'A stair' while the second uses 'A fence'. Our method effectively handles both scenarios, producing view-consistent results with superior geometric accuracy (note the well-preserved stair structure and depth, while (a) and (b) preserve residual geometry from original objects, and (c) generates artifacts) and realistic textures (observe the cleaner and more structured fence pattern in our results).
  • Figure 5: Comparison of different fine-tuning strategies for the diffusion model. (a) Original diffusion model without fine-tuning; (b) Fine-tuning with normal maps only, using modality identifier 'normal map' as prompt; (c) Fine-tuning with both RGB images and normal maps, using modality identifiers 'RGB image' and 'normal map' as prompts; (d) Our approach: fine-tuning with RGB-normal image pairs using BLIP-generated captions prepended with modality identifiers as prompts. Results show that while strategies (b) and (c) underperform compared to the original model, our method significantly enhances the model's capability in normal map reconstruction.
  • ...and 1 more figures