Table of Contents
Fetching ...

G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images

Zixiong Huang, Qi Chen, Libo Sun, Yifan Yang, Naizhou Wang, Mingkui Tan, Qi Wu

TL;DR

G-NeRF introduces a geometry-driven pipeline for single-shot novel view synthesis by leveraging geometry priors from an off-the-shelf 3D GAN (EG3D) through Geometry-guided Multi-View Synthesis (GMVS) and enforcing depth-aware learning via a depth-aware discriminator (DaT). A truncation-based sampling strategy balances identity diversity and geometric fidelity in synthetic multi-view data, while a depth-guided adversarial objective provides depth-consistent supervision for real-world single-view inputs. Experiments on FFHQ, AFHQv2-Cats, and CelebAMask-HQ demonstrate improved depth accuracy, identity preservation, and view-consistency over single-view baselines such as Pix2NeRF, with favorable inference speed compared to GAN-inversion methods. The approach achieves high-fidelity 3D-consistent renderings without test-time optimization and highlights the practical potential of combining 3D GAN priors with NeRF-based rendering for scalable, single-view 3D synthesis.

Abstract

Novel view synthesis aims to generate new view images of a given view image collection. Recent attempts address this problem relying on 3D geometry priors (e.g., shapes, sizes, and positions) learned from multi-view images. However, such methods encounter the following limitations: 1) they require a set of multi-view images as training data for a specific scene (e.g., face, car or chair), which is often unavailable in many real-world scenarios; 2) they fail to extract the geometry priors from single-view images due to the lack of multi-view supervision. In this paper, we propose a Geometry-enhanced NeRF (G-NeRF), which seeks to enhance the geometry priors by a geometry-guided multi-view synthesis approach, followed by a depth-aware training. In the synthesis process, inspired that existing 3D GAN models can unconditionally synthesize high-fidelity multi-view images, we seek to adopt off-the-shelf 3D GAN models, such as EG3D, as a free source to provide geometry priors through synthesizing multi-view data. Simultaneously, to further improve the geometry quality of the synthetic data, we introduce a truncation method to effectively sample latent codes within 3D GAN models. To tackle the absence of multi-view supervision for single-view images, we design the depth-aware training approach, incorporating a depth-aware discriminator to guide geometry priors through depth maps. Experiments demonstrate the effectiveness of our method in terms of both qualitative and quantitative results.

G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images

TL;DR

G-NeRF introduces a geometry-driven pipeline for single-shot novel view synthesis by leveraging geometry priors from an off-the-shelf 3D GAN (EG3D) through Geometry-guided Multi-View Synthesis (GMVS) and enforcing depth-aware learning via a depth-aware discriminator (DaT). A truncation-based sampling strategy balances identity diversity and geometric fidelity in synthetic multi-view data, while a depth-guided adversarial objective provides depth-consistent supervision for real-world single-view inputs. Experiments on FFHQ, AFHQv2-Cats, and CelebAMask-HQ demonstrate improved depth accuracy, identity preservation, and view-consistency over single-view baselines such as Pix2NeRF, with favorable inference speed compared to GAN-inversion methods. The approach achieves high-fidelity 3D-consistent renderings without test-time optimization and highlights the practical potential of combining 3D GAN priors with NeRF-based rendering for scalable, single-view 3D synthesis.

Abstract

Novel view synthesis aims to generate new view images of a given view image collection. Recent attempts address this problem relying on 3D geometry priors (e.g., shapes, sizes, and positions) learned from multi-view images. However, such methods encounter the following limitations: 1) they require a set of multi-view images as training data for a specific scene (e.g., face, car or chair), which is often unavailable in many real-world scenarios; 2) they fail to extract the geometry priors from single-view images due to the lack of multi-view supervision. In this paper, we propose a Geometry-enhanced NeRF (G-NeRF), which seeks to enhance the geometry priors by a geometry-guided multi-view synthesis approach, followed by a depth-aware training. In the synthesis process, inspired that existing 3D GAN models can unconditionally synthesize high-fidelity multi-view images, we seek to adopt off-the-shelf 3D GAN models, such as EG3D, as a free source to provide geometry priors through synthesizing multi-view data. Simultaneously, to further improve the geometry quality of the synthetic data, we introduce a truncation method to effectively sample latent codes within 3D GAN models. To tackle the absence of multi-view supervision for single-view images, we design the depth-aware training approach, incorporating a depth-aware discriminator to guide geometry priors through depth maps. Experiments demonstrate the effectiveness of our method in terms of both qualitative and quantitative results.
Paper Structure (38 sections, 8 equations, 16 figures, 5 tables)

This paper contains 38 sections, 8 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Comparison of different methods. To achieve single-shot novel view synthesis, previous methods either (a) require real-world multi-view images to establish geometry priors or (b) need additional optimization for a specific image. (c) In contrast, our method captures the geometry priors from an existing 3D GAN trained on single-view images only.
  • Figure 2: Overall scheme of G-NeRF. Given a latent code $\mathbf{w}$ randomly sample in $\mathcal{W}$ space, we first apply a truncation method to obtain $\mathbf{w}^{\prime}$, bringing it closer to the center of mass of $\mathcal{W}$ space represented $\bar{\mathbf{w}}$. After that, in conjunction with a set of camera poses $\{\mathbf{P}_{f}, \mathbf{P}_{s}, \mathbf{P}_{d}\}$, we generate a triplet of synthetic data $\{{I}_{f}, {I}_{s}, \mathbf{D}_{syn}\}$. To capture geometry priors from synthetic multi-view images, we synthesize a novel view $\hat{I}_{s}$ using ${I}_{f}$ as the reference image and enforce it to be consistent with ${I}_{s}$. Additionally, we employ a self-reconstruction task with the real-world image ${I}{r}$. Moreover, we design a depth-aware discriminator $\mathcal{D}_{g}$ to further enhance the depth quality of the generated scenes.
  • Figure 3: Illustration of the trade-off between identity diversity and geometry quality of the generated samples. Samples are generated by EG3D chan2022efficient with the same set of latent codes and different truncation ratios $\psi$. As $\psi$ rises, the identity diversity (e.g., hair color, skin color, and glasses) of the generated samples also increases. In contrast, the geometry quality of these scenes gradually reduces.
  • Figure 4: Qualitative comparison. Compared to Pix2NeRF cai2022pix2nerf, our G-NeRF demonstrates the capability to generate novel views that closely resemble reference images with higher clarity (Comparison at $512^2$).
  • Figure 5: Qualitative comparisons with PTI roich2021pivotal.
  • ...and 11 more figures