Table of Contents
Fetching ...

Single Image, Any Face: Generalisable 3D Face Generation

Wenqing Wang, Haosen Yang, Josef Kittler, Xiatian Zhu

TL;DR

Gen3D-Face tackles the challenge of generating photorealistic 3D face avatars from a single unconstrained image and demonstrates strong generalisation to out-of-domain imagery. It introduces a diffusion-based pipeline that produces consistent multi-view images from one input and reconstructs 3D geometry by conditioning on an input-estimated mesh, augmented with synthetic data from PanoHead. A joint multi-view generation strategy and careful data curation enable robust appearance and identity preservation across views, yielding state-of-the-art or competitive results in out-of-domain tests while maintaining in-domain performance. The work offers a practical pathway to deployable single-image avatars for applications in AR/VR, video conferencing, and virtual modeling.

Abstract

The creation of 3D human face avatars from a single unconstrained image is a fundamental task that underlies numerous real-world vision and graphics applications. Despite the significant progress made in generative models, existing methods are either less suited in design for human faces or fail to generalise from the restrictive training domain to unconstrained facial images. To address these limitations, we propose a novel model, Gen3D-Face, which generates 3D human faces with unconstrained single image input within a multi-view consistent diffusion framework. Given a specific input image, our model first produces multi-view images, followed by neural surface construction. To incorporate face geometry information in a generalisable manner, we utilise input-conditioned mesh estimation instead of ground-truth mesh along with synthetic multi-view training data. Importantly, we introduce a multi-view joint generation scheme to enhance appearance consistency among different views. To the best of our knowledge, this is the first attempt and benchmark for creating photorealistic 3D human face avatars from single images for generic human subject across domains. Extensive experiments demonstrate the superiority of our method over previous alternatives for out-of-domain singe image 3D face generation and top competition for in-domain setting.

Single Image, Any Face: Generalisable 3D Face Generation

TL;DR

Gen3D-Face tackles the challenge of generating photorealistic 3D face avatars from a single unconstrained image and demonstrates strong generalisation to out-of-domain imagery. It introduces a diffusion-based pipeline that produces consistent multi-view images from one input and reconstructs 3D geometry by conditioning on an input-estimated mesh, augmented with synthetic data from PanoHead. A joint multi-view generation strategy and careful data curation enable robust appearance and identity preservation across views, yielding state-of-the-art or competitive results in out-of-domain tests while maintaining in-domain performance. The work offers a practical pathway to deployable single-image avatars for applications in AR/VR, video conferencing, and virtual modeling.

Abstract

The creation of 3D human face avatars from a single unconstrained image is a fundamental task that underlies numerous real-world vision and graphics applications. Despite the significant progress made in generative models, existing methods are either less suited in design for human faces or fail to generalise from the restrictive training domain to unconstrained facial images. To address these limitations, we propose a novel model, Gen3D-Face, which generates 3D human faces with unconstrained single image input within a multi-view consistent diffusion framework. Given a specific input image, our model first produces multi-view images, followed by neural surface construction. To incorporate face geometry information in a generalisable manner, we utilise input-conditioned mesh estimation instead of ground-truth mesh along with synthetic multi-view training data. Importantly, we introduce a multi-view joint generation scheme to enhance appearance consistency among different views. To the best of our knowledge, this is the first attempt and benchmark for creating photorealistic 3D human face avatars from single images for generic human subject across domains. Extensive experiments demonstrate the superiority of our method over previous alternatives for out-of-domain singe image 3D face generation and top competition for in-domain setting.
Paper Structure (14 sections, 4 equations, 9 figures, 7 tables)

This paper contains 14 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: 3D human face avatar from (a) a single unconstrained image by (c) prior state of the art model chen2024morphable (note the hallucinated hat and clear identity shift), vs. (d) our model.
  • Figure 2: An overview of our Gen3D-Face. It adopts the latent diffusion paradigm involving the learning of multi-step denoising. Each step denoises $N$ novel views conditioned on a single face image $y$ and the mesh $\mathbb{M}$ estimated from $y$, following the process outlined as below: (a) A light CNN encoder is used to integrate the noise multi-view images $\mathbf{x}^{(1:N)}_t$ generated in the previous steps with camera angles and time embedding; (b) Its output is interpolated with a predefined 3D voxel to obtain the appearance feature volume$F_{a}$; (c) Combining $F_{a}$ with the geometry prior $\mathbb{M}$ yields the hybrid feature volume$F_{ag}$; (d) Finally, the denoised views $\mathbf{x}^{(1:N)}_{t-1}$ are obtained by injecting $F_{ag}$ to FrustumTV3DNet to obtain view frustum volume $F_{vf}$, which is fed into the diffusion backbone as the conditioning signal.
  • Figure 3: Examples of synthetic face images.
  • Figure 4: Examples of novel view generation on FFHQ ( out-of-domain setting). The test views come from Facescape yang2020facescape testing view except Era3D.
  • Figure 5: Examples of novel view generation for the H3DS ( out-of-domain setting). The test views are uniformly sampled across $360^\circ$.
  • ...and 4 more figures