Table of Contents
Fetching ...

ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling

Francesca Babiloni, Alexandros Lattas, Jiankang Deng, Stefanos Zafeiriou

TL;DR

ID-to-3D addresses the challenge of producing high-fidelity, identity-consistent 3D head assets from casual 2D images. It introduces Score Distillation Sampling guided by 2D priors, conditioned on ArcFace identity and CLIP-based expressions, to decouple geometry and texture via a two-stage, neural parametric head framework built on DMTET and a UV-map texture model. A key innovation is finetuning only about 0.2% of parameters with LoRA adapters and integrating identity-expression conditioning through multimodal cross-attention, enabling up to 13 distinct expressions per subject while preserving identity across unseen identities. The results show state-of-the-art 3D head quality, relightable textures, and robust identity consistency across viewpoints and expressions without requiring large 3D scan datasets, making render-ready 3D assets practical for gaming and telepresence; the work also discusses ethical considerations and limitations around biases and computational demands.

Abstract

We propose ID-to-3D, a method to generate identity- and text-guided 3D human heads with disentangled expressions, starting from even a single casually captured in-the-wild image of a subject. The foundation of our approach is anchored in compositionality, alongside the use of task-specific 2D diffusion models as priors for optimization. First, we extend a foundational model with a lightweight expression-aware and ID-aware architecture, and create 2D priors for geometry and texture generation, via fine-tuning only 0.2% of its available training parameters. Then, we jointly leverage a neural parametric representation for the expressions of each subject and a multi-stage generation of highly detailed geometry and albedo texture. This combination of strong face identity embeddings and our neural representation enables accurate reconstruction of not only facial features but also accessories and hair and can be meshed to provide render-ready assets for gaming and telepresence. Our results achieve an unprecedented level of identity-consistent and high-quality texture and geometry generation, generalizing to a ``world'' of unseen 3D identities, without relying on large 3D captured datasets of human assets.

ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling

TL;DR

ID-to-3D addresses the challenge of producing high-fidelity, identity-consistent 3D head assets from casual 2D images. It introduces Score Distillation Sampling guided by 2D priors, conditioned on ArcFace identity and CLIP-based expressions, to decouple geometry and texture via a two-stage, neural parametric head framework built on DMTET and a UV-map texture model. A key innovation is finetuning only about 0.2% of parameters with LoRA adapters and integrating identity-expression conditioning through multimodal cross-attention, enabling up to 13 distinct expressions per subject while preserving identity across unseen identities. The results show state-of-the-art 3D head quality, relightable textures, and robust identity consistency across viewpoints and expressions without requiring large 3D scan datasets, making render-ready 3D assets practical for gaming and telepresence; the work also discusses ethical considerations and limitations around biases and computational demands.

Abstract

We propose ID-to-3D, a method to generate identity- and text-guided 3D human heads with disentangled expressions, starting from even a single casually captured in-the-wild image of a subject. The foundation of our approach is anchored in compositionality, alongside the use of task-specific 2D diffusion models as priors for optimization. First, we extend a foundational model with a lightweight expression-aware and ID-aware architecture, and create 2D priors for geometry and texture generation, via fine-tuning only 0.2% of its available training parameters. Then, we jointly leverage a neural parametric representation for the expressions of each subject and a multi-stage generation of highly detailed geometry and albedo texture. This combination of strong face identity embeddings and our neural representation enables accurate reconstruction of not only facial features but also accessories and hair and can be meshed to provide render-ready assets for gaming and telepresence. Our results achieve an unprecedented level of identity-consistent and high-quality texture and geometry generation, generalizing to a ``world'' of unseen 3D identities, without relying on large 3D captured datasets of human assets.
Paper Structure (21 sections, 6 equations, 15 figures, 1 table)

This paper contains 21 sections, 6 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: ID-to-3D leverages identity conditioning and score distillation sampling on large diffusion models, achieving high-quality 3D human asset generation from "in-the-wild" images, without training on large scanned datasets. From left to right: a) renderings, b) input images, c) normals.
  • Figure 2: (Left)Overall pipeline. ID-to-3D generates expressive 3D head avatars via ArcFace $y_{\text{id}}$ and textual $y_{\text{text}}$ conditioning. It uses as prior geometry-oriented $\phi_{g}$ and albedo oriented $\phi_{a}$ pretrained models. Training) The training phase uses SDS to optimize 3D geometry $\psi_{g}$, texture $\psi_{a}$, and a set of expressions latent codes $\mathbf{k}_{\textbf{exp}}$. It also leverages random lighting $\mathbf{l}$ and random expression conditioning $y_{\text{exp}}$. Inference) At deployment time, ID-to-3D extracts high-quality identity-aware expressive 3D meshes. (Right)ID-consistent expressive 3D heads generated by our method. ID-to-3D creates 3D assets that support relighting, ID-consistent editing, and physical simulation.
  • Figure 3: Qualitative results for text-to-3D (*) and image-to-3D methods. Methods are evaluated under the same text prompt and rendering conditions. DreamCraft3D is reported as DC3D. Geometry is displayed via normal maps in camera coordinates. Using only a small set of $5$ images as conditioning, ID-to-3D achieves high geometric quality and realistic textures.
  • Figure 4: (Left) Identity Similarity Distribution between "in-the-wild" images and renderings of 3D heads. (Right) Comparative Preference Survey on texture quality (outside) and geometry quality (in). We report $\%$ of preferences.
  • Figure 5: ID-to-3D expression diversity. Renderings and normal maps in camera coordinates are taken for $3$ identities: Will Smith, Anya Taylor Joy, and Kanye West. Our method achieves fine-grained geometry carving and high-quality texture generation, realistically reproducing various skin tones.
  • ...and 10 more figures