High-Quality 3D Head Reconstruction from Any Single Portrait Image
Jianfu Zhang, Yujie Gao, Jiahui Zhan, Wentao Wang, Yiyi Zhang, Haohua Zhao, Liqing Zhang
TL;DR
This work tackles high-fidelity 3D head reconstruction from a single portrait by introducing an identity-guided two-stage diffusion pipeline (ID-TS) and a large 3D digital human head dataset. The method generates consistent multi-view images from a single input, then reconstructs a detailed 3D Gaussian avatar, guided by identity and expression information through ArcFace embeddings and a dual-stage denoising strategy. Key contributions include the ID-TS diffusion model, an enhanced Gaussian avatar reconstruction framework, and a comprehensive dataset of 227 portrait sequences across 96 viewpoints. The approach achieves state-of-the-art fidelity on single-image-to-3D head tasks and enables practical applications such as frontal-view prediction, stylized head generation, and facial animation, with potential impact on gaming, AR/VR, and telepresence.
Abstract
In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories
