Table of Contents
Fetching ...

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

Zhenglin Zhou, Fan Ma, Hehe Fan, Zongxin Yang, Yi Yang

TL;DR

HeadStudio addresses the challenge of text-to-animatable 3D head avatars by coupling 3D Gaussian splatting with an animatable head prior (FLAME). It introduces Animatable Head Gaussian and a Text to Avatar Optimization pipeline with super-dense initialization, animation-aware diffusion distillation with MediaPipe guidance, and adaptive geometry regularization. The method delivers high-fidelity avatars with real-time rendering (≥40 fps at 1024 resolution) and supports driving via speech and video. Empirical results demonstrate superior geometry, texture fidelity, and animation coherence compared with state-of-the-art static and dynamic head methods. The work enables efficient text-to-avatar creation for real-time applications.

Abstract

Creating digital avatars from textual prompts has long been a desirable yet challenging task. Despite the promising results achieved with 2D diffusion priors, current methods struggle to create high-quality and consistent animated avatars efficiently. Previous animatable head models like FLAME have difficulty in accurately representing detailed texture and geometry. Additionally, high-quality 3D static representations face challenges in semantically driving with dynamic priors. In this paper, we introduce \textbf{HeadStudio}, a novel framework that utilizes 3D Gaussian splatting to generate realistic and animatable avatars from text prompts. Firstly, we associate 3D Gaussians with animatable head prior model, facilitating semantic animation on high-quality 3D representations. To ensure consistent animation, we further enhance the optimization from initialization, distillation, and regularization to jointly learn the shape, texture, and animation. Extensive experiments demonstrate the efficacy of HeadStudio in generating animatable avatars from textual prompts, exhibiting appealing appearances. The avatars are capable of rendering high-quality real-time ($\geq 40$ fps) novel views at a resolution of 1024. Moreover, These avatars can be smoothly driven by real-world speech and video. We hope that HeadStudio can enhance digital avatar creation and gain popularity in the community. Code is at: https://github.com/ZhenglinZhou/HeadStudio.

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

TL;DR

HeadStudio addresses the challenge of text-to-animatable 3D head avatars by coupling 3D Gaussian splatting with an animatable head prior (FLAME). It introduces Animatable Head Gaussian and a Text to Avatar Optimization pipeline with super-dense initialization, animation-aware diffusion distillation with MediaPipe guidance, and adaptive geometry regularization. The method delivers high-fidelity avatars with real-time rendering (≥40 fps at 1024 resolution) and supports driving via speech and video. Empirical results demonstrate superior geometry, texture fidelity, and animation coherence compared with state-of-the-art static and dynamic head methods. The work enables efficient text-to-avatar creation for real-time applications.

Abstract

Creating digital avatars from textual prompts has long been a desirable yet challenging task. Despite the promising results achieved with 2D diffusion priors, current methods struggle to create high-quality and consistent animated avatars efficiently. Previous animatable head models like FLAME have difficulty in accurately representing detailed texture and geometry. Additionally, high-quality 3D static representations face challenges in semantically driving with dynamic priors. In this paper, we introduce \textbf{HeadStudio}, a novel framework that utilizes 3D Gaussian splatting to generate realistic and animatable avatars from text prompts. Firstly, we associate 3D Gaussians with animatable head prior model, facilitating semantic animation on high-quality 3D representations. To ensure consistent animation, we further enhance the optimization from initialization, distillation, and regularization to jointly learn the shape, texture, and animation. Extensive experiments demonstrate the efficacy of HeadStudio in generating animatable avatars from textual prompts, exhibiting appealing appearances. The avatars are capable of rendering high-quality real-time ( fps) novel views at a resolution of 1024. Moreover, These avatars can be smoothly driven by real-world speech and video. We hope that HeadStudio can enhance digital avatar creation and gain popularity in the community. Code is at: https://github.com/ZhenglinZhou/HeadStudio.
Paper Structure (19 sections, 9 equations, 18 figures, 1 table)

This paper contains 19 sections, 9 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: Text-based animatable avatars generation by HeadStudio. With only one end-to-end training stage of 2 hours on 1 NVIDIA A6000 GPU, HeadStudio is able to generate animatable, high-fidelity and real-time rendering ($\geq 40$ fps) head avatars using text inputs.
  • Figure 2: Framework of HeadStudio, which integrates animatable head prior model into 3D Gaussian splatting and score distillation sampling. 1) Animatable Head Gaussian: each 3D point is rigged to a mesh, and then rotated, scaled, and translated by the mesh deformation. 2) Text to Avatar Optimization: enhance the optimization from initialization, distillation and regularization, including: super-dense Gaussian initialization, animation-based text-to-3D distillation, and adaptive geometry regularization.
  • Figure 3: Visualization of Mesh Area.
  • Figure 4: Comparison with the text-to-static avatar generation methods. Our approach excels at producing high-fidelity head avatars, yielding superior results.
  • Figure 5: Comparison with the text-to-dynamic avatar generation method TADA liao2023tada in terms of semantic alignment and rendering speed. The yellow circles indicate semantic misalignment in the mouths, resulting in misplaced mouth texture. The rendering speed evaluation on the same device is reported in the blue box. The FLAME mesh of the avatar is visualized on the bottom right. Our method provides effective semantic alignment, smooth expression deformation, and real-time rendering.
  • ...and 13 more figures