Table of Contents
Fetching ...

GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

Xiao Pan, Zongxin Yang, Shuai Bai, Yi Yang

TL;DR

GD$^2$-NeRF presents a coarse-to-fine framework for one-shot novel view synthesis that is inference-time finetuning-free. It combines a GAN-powered coarse stage (OPP) to inject in-distribution details with a diffusion-based fine stage (Diff3DE) to add out-distribution detail while preserving 3D consistency. The One-stage Parallel Pipeline enables parallel optimization (DPS, CoRF, DPF) to balance fidelity and sharpness, with Diff3DE propagating diffusion tokens across nearby keyframes for 3D-aware enhancement. Evaluations on ShapeNet and DTU show improved detail and 3D-consistency without per-scene optimization, outperforming state-of-the-art baselines in several metrics while maintaining efficiency.

Abstract

In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limited reference image. On the other hand, recent diffusion-based image-to-3d methods show vivid plausible results via distilling pre-trained 2D diffusion models into a 3D representation, yet require tedious per-scene optimization. Targeting these issues, we propose the GD$^2$-NeRF, a Generative Detail compensation framework via GAN and Diffusion that is both inference-time finetuning-free and with vivid plausible details. In detail, following a coarse-to-fine strategy, GD$^2$-NeRF is mainly composed of a One-stage Parallel Pipeline (OPP) and a 3D-consistent Detail Enhancer (Diff3DE). At the coarse stage, OPP first efficiently inserts the GAN model into the existing OG-NeRF pipeline for primarily relieving the blurry issue with in-distribution priors captured from the training dataset, achieving a good balance between sharpness (LPIPS, FID) and fidelity (PSNR, SSIM). Then, at the fine stage, Diff3DE further leverages the pre-trained image diffusion models to complement rich out-distribution details while maintaining decent 3D consistency. Extensive experiments on both the synthetic and real-world datasets show that GD$^2$-NeRF noticeably improves the details while without per-scene finetuning.

GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

TL;DR

GD-NeRF presents a coarse-to-fine framework for one-shot novel view synthesis that is inference-time finetuning-free. It combines a GAN-powered coarse stage (OPP) to inject in-distribution details with a diffusion-based fine stage (Diff3DE) to add out-distribution detail while preserving 3D consistency. The One-stage Parallel Pipeline enables parallel optimization (DPS, CoRF, DPF) to balance fidelity and sharpness, with Diff3DE propagating diffusion tokens across nearby keyframes for 3D-aware enhancement. Evaluations on ShapeNet and DTU show improved detail and 3D-consistency without per-scene optimization, outperforming state-of-the-art baselines in several metrics while maintaining efficiency.

Abstract

In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limited reference image. On the other hand, recent diffusion-based image-to-3d methods show vivid plausible results via distilling pre-trained 2D diffusion models into a 3D representation, yet require tedious per-scene optimization. Targeting these issues, we propose the GD-NeRF, a Generative Detail compensation framework via GAN and Diffusion that is both inference-time finetuning-free and with vivid plausible details. In detail, following a coarse-to-fine strategy, GD-NeRF is mainly composed of a One-stage Parallel Pipeline (OPP) and a 3D-consistent Detail Enhancer (Diff3DE). At the coarse stage, OPP first efficiently inserts the GAN model into the existing OG-NeRF pipeline for primarily relieving the blurry issue with in-distribution priors captured from the training dataset, achieving a good balance between sharpness (LPIPS, FID) and fidelity (PSNR, SSIM). Then, at the fine stage, Diff3DE further leverages the pre-trained image diffusion models to complement rich out-distribution details while maintaining decent 3D consistency. Extensive experiments on both the synthetic and real-world datasets show that GD-NeRF noticeably improves the details while without per-scene finetuning.
Paper Structure (25 sections, 13 equations, 10 figures, 9 tables)

This paper contains 25 sections, 13 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Given a single reference image, our method GD$^2$-NeRF synthesizes novel views with vivid plausible details in an inference-time finetuning-free manner. It is a coarse-to-fine generative detail compensation framework composed of OPP and Diff3DE. OPP first injects the GAN model into existing OG-NeRF pipelines, e.g., PixelNeRF PixelNeRF_yu2021pixelnerf, for achieving in-distribution detail priors. Then, Diff3DE further incorporates the out-distribution detail priors from the pre-trained diffusion models rombach2022LatentDiffzhang2023addingControlNet. We highly recommend readers to check our video demos for more intuitive comparisons.
  • Figure 2: Comparison between the existing encoder-only OG-NeRF and our generative detail compensation perspective (§ \ref{['sec:intro']}). OG-NeRF suffers the blurry issue due to the projected misleading features while we propose to complement the object details via the prior learned by the generative model.
  • Figure 3: Overview of two basic tandem pipelines (§ \ref{['sec:ttp']}, § \ref{['sec:otp']}) and our proposed One-stage Parallel Pipeline (OPP, § \ref{['sec:opp']}) for integrating the GAN model into the OG-NeRF framework at the COARSE STAGE.
  • Figure 4: Overview of the COARSE-STAGE method OPP for including in-distribution details from the training data (§ \ref{['sec:opp']}). It is built on the one-stage tandem pipeline (the first row) and efficiently integrates the GAN and OG-NeRF models in a unified parallel framework with DPS, CoRF, and DPF.
  • Figure 5: Overview of the FINE-STAGE method Diff3DE for including out-distribution details from the pre-trained diffusion model rombach2022LatentDiffzhang2023addingControlNet (§ \ref{['fine_stage_Diff3DE']}). We first fix $N_k$ dense keyframes around the dome. Then, for each target view, we select $3$ neighbor keyframes based on the cosine similarity. For each diffusion time step and attention block, the output tokens of the target view are the barycentric interpolation of the propagated tokens from neighbor keyframes, using the correspondence calculated during DDIM inversion. The global 3D consistency is primarily achieved by the 3D-consistent constraint from OPP and further approximated by enforcing the local consistency for each neighbor area.
  • ...and 5 more figures