Table of Contents
Fetching ...

Dual Encoder GAN Inversion for High-Fidelity 3D Head Reconstruction from Single Images

Bahri Batuhan Bilecen, Ahmet Berke Gokmen, Aysegul Dundar

TL;DR

This work addresses 360-degree 3D head reconstruction from a single image by inverting into the PanoHead latent space, overcoming EG3D limitations with a dual-encoder framework and an occlusion-aware triplane discriminator. A stitching mechanism in the triplane domain merges outputs from two specialized encoders to achieve both high-fidelity input reconstructions and realistic unseen-view generations, enabling consistent 360-degree renders. Quantitatively, the method outperforms state-of-the-art encoders and optimization-based approaches across L$2$, LPIPS, ID, and FID on FFHQ+LPFF and MEAD datasets, with qualitative gains in hair realism and pose variability; an editing workflow is demonstrated in the triplane space. These advances promise improved 3D-aware face synthesis for AR/VR and film, while underscoring ethical considerations around deepfakes and the need for safeguards. The approach leverages latent spaces such as $\mathcal{W}^+$ and introduces an occlusion-aware discriminator $\mathcal{D}$ to manage visible and occluded regions throughout 360-degree rendering.

Abstract

3D GAN inversion aims to project a single image into the latent space of a 3D Generative Adversarial Network (GAN), thereby achieving 3D geometry reconstruction. While there exist encoders that achieve good results in 3D GAN inversion, they are predominantly built on EG3D, which specializes in synthesizing near-frontal views and is limiting in synthesizing comprehensive 3D scenes from diverse viewpoints. In contrast to existing approaches, we propose a novel framework built on PanoHead, which excels in synthesizing images from a 360-degree perspective. To achieve realistic 3D modeling of the input image, we introduce a dual encoder system tailored for high-fidelity reconstruction and realistic generation from different viewpoints. Accompanying this, we propose a stitching framework on the triplane domain to get the best predictions from both. To achieve seamless stitching, both encoders must output consistent results despite being specialized for different tasks. For this reason, we carefully train these encoders using specialized losses, including an adversarial loss based on our novel occlusion-aware triplane discriminator. Experiments reveal that our approach surpasses the existing encoder training methods qualitatively and quantitatively. Please visit the project page: https://berkegokmen1.github.io/dual-enc-3d-gan-inv.

Dual Encoder GAN Inversion for High-Fidelity 3D Head Reconstruction from Single Images

TL;DR

This work addresses 360-degree 3D head reconstruction from a single image by inverting into the PanoHead latent space, overcoming EG3D limitations with a dual-encoder framework and an occlusion-aware triplane discriminator. A stitching mechanism in the triplane domain merges outputs from two specialized encoders to achieve both high-fidelity input reconstructions and realistic unseen-view generations, enabling consistent 360-degree renders. Quantitatively, the method outperforms state-of-the-art encoders and optimization-based approaches across L, LPIPS, ID, and FID on FFHQ+LPFF and MEAD datasets, with qualitative gains in hair realism and pose variability; an editing workflow is demonstrated in the triplane space. These advances promise improved 3D-aware face synthesis for AR/VR and film, while underscoring ethical considerations around deepfakes and the need for safeguards. The approach leverages latent spaces such as and introduces an occlusion-aware discriminator to manage visible and occluded regions throughout 360-degree rendering.

Abstract

3D GAN inversion aims to project a single image into the latent space of a 3D Generative Adversarial Network (GAN), thereby achieving 3D geometry reconstruction. While there exist encoders that achieve good results in 3D GAN inversion, they are predominantly built on EG3D, which specializes in synthesizing near-frontal views and is limiting in synthesizing comprehensive 3D scenes from diverse viewpoints. In contrast to existing approaches, we propose a novel framework built on PanoHead, which excels in synthesizing images from a 360-degree perspective. To achieve realistic 3D modeling of the input image, we introduce a dual encoder system tailored for high-fidelity reconstruction and realistic generation from different viewpoints. Accompanying this, we propose a stitching framework on the triplane domain to get the best predictions from both. To achieve seamless stitching, both encoders must output consistent results despite being specialized for different tasks. For this reason, we carefully train these encoders using specialized losses, including an adversarial loss based on our novel occlusion-aware triplane discriminator. Experiments reveal that our approach surpasses the existing encoder training methods qualitatively and quantitatively. Please visit the project page: https://berkegokmen1.github.io/dual-enc-3d-gan-inv.
Paper Structure (20 sections, 9 equations, 24 figures, 6 tables)

This paper contains 20 sections, 9 equations, 24 figures, 6 tables.

Figures (24)

  • Figure 1: From a single input image (first column), our framework reconstructs 3D representation by inverting images into PanoHead's latent space, which can be viewed in a 360-degree perspective.
  • Figure 2: Overall architecture of PanoHead.
  • Figure 3: Our training methodology for the triplane discriminator involves generating real samples by sampling latent vectors $\mathcal{Z}^+$ and producing in-domain triplanes using PanoHead. Fake samples are generated from encoded images. Despite the effectiveness of adversarial loss in enhancing reconstructions, challenges may persist in achieving high fidelity to the input due to the origin of real samples from the generator G. To address this, we propose an occlusion-aware discriminator$\mathcal{D}$, trained exclusively with features from occluded pixels. This ensures that visible regions, such as frontal views $\pi_R$, have reduced influence during the training of $\mathcal{D}$.
  • Figure 4: The inference pipeline with dual encoders for full 3D head reconstruction. Given a face portrait with pose $\pi_R$, we can perform 360-degree rendering from any given pose $\pi_\text{novel}.$
  • Figure 5: Visual results of Encoder 1, Encoder 2, and Dual encoders for the given input images in the first and sixth columns.
  • ...and 19 more figures