Table of Contents
Fetching ...

HeadEvolver: Text to Head Avatars via Expressive and Attribute-Preserving Mesh Deformation

Duotun Wang, Hengyu Meng, Zeyu Cai, Zhijing Shao, Qianxi Liu, Lin Wang, Mingming Fan, Xiaohang Zhan, Zeyu Wang

TL;DR

HeadEvolver tackles the editing bottleneck in text-driven head-avatar generation by switching from implicit representations to an explicit template-mesh deformation driven by per-face Jacobians $J_f$ augmented with a learnable vector field $H_f$. Deformations are computed via a Poisson formulation and guided by a score distillation loss from 2D diffusion priors, with landmark and evolving contour regularizers to preserve 3D attributes such as landmarks, rig, and UVs. The method achieves expressive, texture-rich head avatars that remain editable in standard graphics tools and compatible with downstream animation, without requiring training data. Experiments demonstrate diverse, high-quality meshes with preserved topology and more faithful identity features, validated by quantitative CLIP-based metrics and user studies.

Abstract

Current text-to-avatar methods often rely on implicit representations (e.g., NeRF, SDF, and DMTet), leading to 3D content that artists cannot easily edit and animate in graphics software. This paper introduces a novel framework for generating stylized head avatars from text guidance, which leverages locally learnable mesh deformation and 2D diffusion priors to achieve high-quality digital assets for attribute-preserving manipulation. Given a template mesh, our method represents mesh deformation with per-face Jacobians and adaptively modulates local deformation using a learnable vector field. This vector field enables anisotropic scaling while preserving the rotation of vertices, which can better express identity and geometric details. We employ landmark- and contour-based regularization terms to balance the expressiveness and plausibility of generated avatars from multiple views without relying on any specific shape prior. Our framework can generate realistic shapes and textures that can be further edited via text, while supporting seamless editing using the preserved attributes from the template mesh, such as 3DMM parameters, blendshapes, and UV coordinates. Extensive experiments demonstrate that our framework can generate diverse and expressive head avatars with high-quality meshes that artists can easily manipulate in graphics software, facilitating downstream applications such as efficient asset creation and animation with preserved attributes.

HeadEvolver: Text to Head Avatars via Expressive and Attribute-Preserving Mesh Deformation

TL;DR

HeadEvolver tackles the editing bottleneck in text-driven head-avatar generation by switching from implicit representations to an explicit template-mesh deformation driven by per-face Jacobians augmented with a learnable vector field . Deformations are computed via a Poisson formulation and guided by a score distillation loss from 2D diffusion priors, with landmark and evolving contour regularizers to preserve 3D attributes such as landmarks, rig, and UVs. The method achieves expressive, texture-rich head avatars that remain editable in standard graphics tools and compatible with downstream animation, without requiring training data. Experiments demonstrate diverse, high-quality meshes with preserved topology and more faithful identity features, validated by quantitative CLIP-based metrics and user studies.

Abstract

Current text-to-avatar methods often rely on implicit representations (e.g., NeRF, SDF, and DMTet), leading to 3D content that artists cannot easily edit and animate in graphics software. This paper introduces a novel framework for generating stylized head avatars from text guidance, which leverages locally learnable mesh deformation and 2D diffusion priors to achieve high-quality digital assets for attribute-preserving manipulation. Given a template mesh, our method represents mesh deformation with per-face Jacobians and adaptively modulates local deformation using a learnable vector field. This vector field enables anisotropic scaling while preserving the rotation of vertices, which can better express identity and geometric details. We employ landmark- and contour-based regularization terms to balance the expressiveness and plausibility of generated avatars from multiple views without relying on any specific shape prior. Our framework can generate realistic shapes and textures that can be further edited via text, while supporting seamless editing using the preserved attributes from the template mesh, such as 3DMM parameters, blendshapes, and UV coordinates. Extensive experiments demonstrate that our framework can generate diverse and expressive head avatars with high-quality meshes that artists can easily manipulate in graphics software, facilitating downstream applications such as efficient asset creation and animation with preserved attributes.
Paper Structure (15 sections, 7 equations, 11 figures, 3 tables)

This paper contains 15 sections, 7 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Framework overview. We deform a template mesh by optimizing per-triangle vector fields guided by a text prompt. Rendered normal and RGB images coupled with MediaPipe landmarks are fed into a diffusion model to compute respective losses. Our regularization of Jacobians controls the fidelity and semantics of facial features that conform to text guidance.
  • Figure 2: Visualization of vector field $H_f$ through $W_f$. Orange colors indicate strong deformations and $W_f=I$ refers to no deformation enhancement.
  • Figure 3: Qualitative appearance comparisons. Our method can produce diverse textured avatars.
  • Figure 4: Qualitative geometry comparisons. Our method effectively preserves the topology and semantics of the input template mesh, resulting in high-quality mesh models for smooth manipulations.
  • Figure 5: Ablation study on the vector fields. The per-triangle $H$ can enhance the identity and character features of the avatar.
  • ...and 6 more figures