Table of Contents
Fetching ...

NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion

Jiatao Gu, Alex Trevithick, Kai-En Lin, Josh Susskind, Christian Theobalt, Lingjie Liu, Ravi Ramamoorthi

TL;DR

NerfDiff tackles single-image novel view synthesis by marrying a camera-space NeRF with a 3D-aware diffusion model. It jointly trains these components and then performs test-time NeRF-guided distillation to generate and refine a set of virtual views, enforcing 3D consistency through diffusion guidance. The NeRF is conditioned by a local triplane representation for efficient rendering, and a 3D-aware CDM refines renderings to reveal occluded details; NeRF-guided distillation (NGD) alternates NeRF updates with diffusion steps to maximize agreement with multi-view denoised targets. Experimental results on ShapeNet, ABO, and Clevr3D show state-of-the-art quantitative and qualitative performance, with notable improvements in sharpness behind occlusions and improved multi-view consistency.

Abstract

Novel view synthesis from a single image requires inferring occluded regions of objects and scenes whilst simultaneously maintaining semantic and physical consistency with the input. Existing approaches condition neural radiance fields (NeRF) on local image features, projecting points to the input image plane, and aggregating 2D features to perform volume rendering. However, under severe occlusion, this projection fails to resolve uncertainty, resulting in blurry renderings that lack details. In this work, we propose NerfDiff, which addresses this issue by distilling the knowledge of a 3D-aware conditional diffusion model (CDM) into NeRF through synthesizing and refining a set of virtual views at test time. We further propose a novel NeRF-guided distillation algorithm that simultaneously generates 3D consistent virtual views from the CDM samples, and finetunes the NeRF based on the improved virtual views. Our approach significantly outperforms existing NeRF-based and geometry-free approaches on challenging datasets, including ShapeNet, ABO, and Clevr3D.

NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion

TL;DR

NerfDiff tackles single-image novel view synthesis by marrying a camera-space NeRF with a 3D-aware diffusion model. It jointly trains these components and then performs test-time NeRF-guided distillation to generate and refine a set of virtual views, enforcing 3D consistency through diffusion guidance. The NeRF is conditioned by a local triplane representation for efficient rendering, and a 3D-aware CDM refines renderings to reveal occluded details; NeRF-guided distillation (NGD) alternates NeRF updates with diffusion steps to maximize agreement with multi-view denoised targets. Experimental results on ShapeNet, ABO, and Clevr3D show state-of-the-art quantitative and qualitative performance, with notable improvements in sharpness behind occlusions and improved multi-view consistency.

Abstract

Novel view synthesis from a single image requires inferring occluded regions of objects and scenes whilst simultaneously maintaining semantic and physical consistency with the input. Existing approaches condition neural radiance fields (NeRF) on local image features, projecting points to the input image plane, and aggregating 2D features to perform volume rendering. However, under severe occlusion, this projection fails to resolve uncertainty, resulting in blurry renderings that lack details. In this work, we propose NerfDiff, which addresses this issue by distilling the knowledge of a 3D-aware conditional diffusion model (CDM) into NeRF through synthesizing and refining a set of virtual views at test time. We further propose a novel NeRF-guided distillation algorithm that simultaneously generates 3D consistent virtual views from the CDM samples, and finetunes the NeRF based on the improved virtual views. Our approach significantly outperforms existing NeRF-based and geometry-free approaches on challenging datasets, including ShapeNet, ABO, and Clevr3D.
Paper Structure (51 sections, 8 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 51 sections, 8 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Renderings from our method in comparison to the SoTA VisionNeRF visionnerf. Note how our method can predict sharp renderings despite large occlusion, whereas VisionNeRF cannot handle this uncertainty and shows implausible blurring.
  • Figure 2: NerfDiff incorporates a training and finetuning pipeline. We first learn the single-image NeRF and 2D CDM, which are conditioned on the single-image NeRF renderings (left). We use the learned network parameters at test time to predict an initial NeRF representation for finetuning. The NeRF-guided denoised images from the frozen CDM then supervise the NeRF in turn (right).
  • Figure 3: Details of the architecture of the single-image NeRF for NerfDiff. Using a UNet, we first map an input image to a camera-aligned triplane-based NeRF representation. This triplane efficiently conditions volume rendering from a targeted view, resulting in an initial rendering. This rendering conditions the diffusion process so the CDM can consistently denoise at that target pose.
  • Figure 4: A qualitative comparison of our approach versus baselines in single-image view synthesis on multiple datasets. Compared to 3D methods like VisionNeRF visionnerf and Ours(w/o NGD), our proposed NerfDiff synthesizes significantly sharper results behind occlusions. Compared to Ours (CDM), our full model showcases its built-in multi-view consistency. The red arrows display the CDM's inability to synthesize consistently across views.
  • Figure 5: A qualitative comparison on Clevr3d obsurf which consists of images from cameras rotated 120 degrees about the z-axis. We showcase generalization to OOD cameras in this figure. As can be seen, VisionNeRF gets a degenerate result, while NerfDiff provides sharper renderings with fewer artifacts.
  • ...and 5 more figures