Table of Contents
Fetching ...

ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models

Meng-Li Shih, Wei-Chiu Ma, Lorenzo Boyice, Aleksander Holynski, Forrester Cole, Brian L. Curless, Janne Kontkanen

TL;DR

ExtraNeRF addresses the challenge of extrapolating neural radiance fields from a small set of views by integrating a base NeRF with diffusion priors guided by a visibility map. The method trains a BaseNeRF on observed views, then iteratively inpaints unseen regions with a scene-tuned diffusion model and enhances details with a second diffusion model, all while supervising with virtual views and depth information. Per-scene diffusion fine-tuning and a dedicated visibility/depth completion pipeline yield sharp, coherent disoccluded content and achieve state-of-the-art results on LLFF and Tanks & Temples benchmarks with few input views. The approach offers a practical pathway to extend NeRFs beyond observed data, enabling richer, more flexible view exploration in real-world capture scenarios.

Abstract

We propose ExtraNeRF, a novel method for extrapolating the range of views handled by a Neural Radiance Field (NeRF). Our main idea is to leverage NeRFs to model scene-specific, fine-grained details, while capitalizing on diffusion models to extrapolate beyond our observed data. A key ingredient is to track visibility to determine what portions of the scene have not been observed, and focus on reconstructing those regions consistently with diffusion models. Our primary contributions include a visibility-aware diffusion-based inpainting module that is fine-tuned on the input imagery, yielding an initial NeRF with moderate quality (often blurry) inpainted regions, followed by a second diffusion model trained on the input imagery to consistently enhance, notably sharpen, the inpainted imagery from the first pass. We demonstrate high-quality results, extrapolating beyond a small number of (typically six or fewer) input views, effectively outpainting the NeRF as well as inpainting newly disoccluded regions inside the original viewing volume. We compare with related work both quantitatively and qualitatively and show significant gains over prior art.

ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models

TL;DR

ExtraNeRF addresses the challenge of extrapolating neural radiance fields from a small set of views by integrating a base NeRF with diffusion priors guided by a visibility map. The method trains a BaseNeRF on observed views, then iteratively inpaints unseen regions with a scene-tuned diffusion model and enhances details with a second diffusion model, all while supervising with virtual views and depth information. Per-scene diffusion fine-tuning and a dedicated visibility/depth completion pipeline yield sharp, coherent disoccluded content and achieve state-of-the-art results on LLFF and Tanks & Temples benchmarks with few input views. The approach offers a practical pathway to extend NeRFs beyond observed data, enabling richer, more flexible view exploration in real-world capture scenarios.

Abstract

We propose ExtraNeRF, a novel method for extrapolating the range of views handled by a Neural Radiance Field (NeRF). Our main idea is to leverage NeRFs to model scene-specific, fine-grained details, while capitalizing on diffusion models to extrapolate beyond our observed data. A key ingredient is to track visibility to determine what portions of the scene have not been observed, and focus on reconstructing those regions consistently with diffusion models. Our primary contributions include a visibility-aware diffusion-based inpainting module that is fine-tuned on the input imagery, yielding an initial NeRF with moderate quality (often blurry) inpainted regions, followed by a second diffusion model trained on the input imagery to consistently enhance, notably sharpen, the inpainted imagery from the first pass. We demonstrate high-quality results, extrapolating beyond a small number of (typically six or fewer) input views, effectively outpainting the NeRF as well as inpainting newly disoccluded regions inside the original viewing volume. We compare with related work both quantitatively and qualitatively and show significant gains over prior art.
Paper Structure (32 sections, 4 equations, 8 figures, 4 tables)

This paper contains 32 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: BaseNeRF vs ExtraNeRF: We train a BaseNeRF model and our ExtraNeRF model on six input views and render the scene from extrapolated viewpoints. Using our visibility-aware, diffusion-guided inpainting and enhancement modules, we are able to synthesize sharp content in disoccluded regions, whereas the BaseNeRF suffers from blurry results (see the red boxes, green boxes, and the close-up insets).
  • Figure 2: Overview of our method: We start from $n$ input images, their camera poses, and depth maps (predicted as described in Sec. \ref{['sec:method']}). In Step 1, we train a BaseNeRF by supervising with this input data. In Step 2, we add supervision from virtual views. We repeatedly inpaint the areas that are unsupervised by the original input views by a diffusion model while continuing to supervise the NeRF with the virtual views. In Step 3, we iterate in similar fashion, but instead of inpainting we apply another diffusion model specifically designed to further improve the detail and color consistency in inpainted regions.
  • Figure 3: The input triplet of diffusion model consists of noisy-image, mask, and an guidance image. While masked pixels of guidance images of ${\Psi}^{\text{inpaint}}$ are erased, they are preserved as the guidance for ${\Psi}^{\text{enhance}}$.
  • Figure 4: Illustration of data collection for enhancement model. We draw a pseudo visibility mask in a captured photo. Ground-truth supervision in the mask is replaced by inpainting supervision when we iteratively optimize NeRF. The optimization corrupts pixels in the mask when rendered with NeRF. A captured photo along with several corrupted images from different optimization iterations can used to train ${\Psi}^{\text{enhance}}$
  • Figure 5: The depth completion model takes a masked depth along with a guidance image as input and completes the depth in the masked region using the guidance of the RGB image.
  • ...and 3 more figures