ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling
Francesca Babiloni, Alexandros Lattas, Jiankang Deng, Stefanos Zafeiriou
TL;DR
ID-to-3D addresses the challenge of producing high-fidelity, identity-consistent 3D head assets from casual 2D images. It introduces Score Distillation Sampling guided by 2D priors, conditioned on ArcFace identity and CLIP-based expressions, to decouple geometry and texture via a two-stage, neural parametric head framework built on DMTET and a UV-map texture model. A key innovation is finetuning only about 0.2% of parameters with LoRA adapters and integrating identity-expression conditioning through multimodal cross-attention, enabling up to 13 distinct expressions per subject while preserving identity across unseen identities. The results show state-of-the-art 3D head quality, relightable textures, and robust identity consistency across viewpoints and expressions without requiring large 3D scan datasets, making render-ready 3D assets practical for gaming and telepresence; the work also discusses ethical considerations and limitations around biases and computational demands.
Abstract
We propose ID-to-3D, a method to generate identity- and text-guided 3D human heads with disentangled expressions, starting from even a single casually captured in-the-wild image of a subject. The foundation of our approach is anchored in compositionality, alongside the use of task-specific 2D diffusion models as priors for optimization. First, we extend a foundational model with a lightweight expression-aware and ID-aware architecture, and create 2D priors for geometry and texture generation, via fine-tuning only 0.2% of its available training parameters. Then, we jointly leverage a neural parametric representation for the expressions of each subject and a multi-stage generation of highly detailed geometry and albedo texture. This combination of strong face identity embeddings and our neural representation enables accurate reconstruction of not only facial features but also accessories and hair and can be meshed to provide render-ready assets for gaming and telepresence. Our results achieve an unprecedented level of identity-consistent and high-quality texture and geometry generation, generalizing to a ``world'' of unseen 3D identities, without relying on large 3D captured datasets of human assets.
