Table of Contents
Fetching ...

Taming Latent Diffusion Model for Neural Radiance Field Inpainting

Chieh Hubert Lin, Changil Kim, Jia-Bin Huang, Qinbo Li, Chih-Yao Ma, Johannes Kopf, Ming-Hsuan Yang, Hung-Yu Tseng

TL;DR

This work proposes tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training and yields state-of-the-art NeRF inpainting results on various real-world scenes.

Abstract

Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes. Project page: https://hubert0527.github.io/MALD-NeRF

Taming Latent Diffusion Model for Neural Radiance Field Inpainting

TL;DR

This work proposes tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training and yields state-of-the-art NeRF inpainting results on various real-world scenes.

Abstract

Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes. Project page: https://hubert0527.github.io/MALD-NeRF
Paper Structure (21 sections, 3 equations, 9 figures, 2 tables)

This paper contains 21 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: NeRF inpainting. Given a set of posed images associated with inpainting masks, the proposed framework estimates a NeRF that renders high-quality novel views, where the inpainting region is realistic and contains high-frequency details. Our algorithm works for both forward-facing scenes and 360$^{\circ}$ scenes, and supports both single object and multiple objects removal.
  • Figure 2: Inconsistency and texture shift issue. We present the 2D inpainting results from our latent diffusion model. Given the same input image and mask, the results are 1) not consistent and 2) produce a texture shift between the original and inpainted pixels. These issues introduce noticeable artifacts in the NeRF inpainting results.
  • Figure 3: Method overview. The proposed method uses a latent diffusion model to obtain the inpainted training images from the NeRF-rendered images using partial DDIM. The inpainted images are used to update the NeRF training dataset following the iterative dataset update protocol. (reconstruction) We use pixel-level regression loss between the NeRF-rendered and ground-truth pixels to reconstruct the regions observed in the input images. (inpainting) We design a masked patch-based adversarial training, which include an adversarial loss and discriminator feature matching loss, to supervise the the inpainting regions.
  • Figure 4: Drawbacks of LPIPS. In some cases, the LPIPS score fails to indicate the visual quality. For example, generating a realistic baseball cap actually lowers the score as there is no object in the inpainting area in the ground truth image.
  • Figure 5: Per-scene customization. Our per-scene customization effectively forges the latent diffusion model to synthesize consistent and in-context contents across views.
  • ...and 4 more figures