Table of Contents
Fetching ...

High-Quality 3D Head Reconstruction from Any Single Portrait Image

Jianfu Zhang, Yujie Gao, Jiahui Zhan, Wentao Wang, Yiyi Zhang, Haohua Zhao, Liqing Zhang

TL;DR

This work tackles high-fidelity 3D head reconstruction from a single portrait by introducing an identity-guided two-stage diffusion pipeline (ID-TS) and a large 3D digital human head dataset. The method generates consistent multi-view images from a single input, then reconstructs a detailed 3D Gaussian avatar, guided by identity and expression information through ArcFace embeddings and a dual-stage denoising strategy. Key contributions include the ID-TS diffusion model, an enhanced Gaussian avatar reconstruction framework, and a comprehensive dataset of 227 portrait sequences across 96 viewpoints. The approach achieves state-of-the-art fidelity on single-image-to-3D head tasks and enables practical applications such as frontal-view prediction, stylized head generation, and facial animation, with potential impact on gaming, AR/VR, and telepresence.

Abstract

In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories

High-Quality 3D Head Reconstruction from Any Single Portrait Image

TL;DR

This work tackles high-fidelity 3D head reconstruction from a single portrait by introducing an identity-guided two-stage diffusion pipeline (ID-TS) and a large 3D digital human head dataset. The method generates consistent multi-view images from a single input, then reconstructs a detailed 3D Gaussian avatar, guided by identity and expression information through ArcFace embeddings and a dual-stage denoising strategy. Key contributions include the ID-TS diffusion model, an enhanced Gaussian avatar reconstruction framework, and a comprehensive dataset of 227 portrait sequences across 96 viewpoints. The approach achieves state-of-the-art fidelity on single-image-to-3D head tasks and enables practical applications such as frontal-view prediction, stylized head generation, and facial animation, with potential impact on gaming, AR/VR, and telepresence.

Abstract

In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Given a single head image, our method can generate multi-view images with rich facial details. And the high-fidelity 3D Gaussian head of the identity is then reconstructed using the multi-view images. Our method also generalizes well to multiple downstream tasks, such as frontal face prediction, stylized head generation, and facial expression animation.
  • Figure 2: Overview of our inference pipeline. Phase 1: In the low-resolution stage, we embed camera pose e and noise step t via positional encoding, concatenate them, and feed them into the U-Net’s residual blocks. The conditional image x is encoded into the latent space of the VAE encoder E, concatenated with noise, and processed alongside CLIP and ArcFace embeddings via cross-attention in the transformer block, generating multi-view images with accurate head shapes and coarse facial texture. In the high-resolution stage, we upsample the previous outputs, do element-wise addition with latent noise, and denoise them, outputting comprises multi-view images with high-fidelity texture details. Phase 2: With front/left/back/right images as inputs and remaining frames as supervision for 3D U-Net optimization, we finally yield a high-quality Gaussian head P.
  • Figure 3: Different expressions and an accessory in 3D digital human models. Diverse facial expressions are sampled to enrich our dataset, along with an accessory to validate the fitting capability of our model.
  • Figure 4: Qualitative comparison. Visualization results demonstrate that our method significantly outperforms existing approaches in capturing fine-grained details, including head geometry, facial feature consistency, expression texture, and gaze direction.
  • Figure 5: Applications for downstream tasks. We evaluate our method on three downstream tasks: frontal view prediction from profile views, stylized head generation, and facial animation, demonstrating its versatility across diverse application scenarios.
  • ...and 1 more figures