Table of Contents
Fetching ...

PESTalk: Speech-Driven 3D Facial Animation with Personalized Emotional Styles

Tianshun Han, Benjia Zhou, Ajian Liu, Yanyan Liang, Du Zhang, Zhen Lei, Jun Wan

TL;DR

Problem: existing speech-driven 3D facial animation often focuses on lip-sync and lacks realistic emotional expression and personalization. Approach: PESTalk introduces a Dual-Stream Emotion Extractor (time- and frequency-domain features) and an Emotional Style Modeling Module to produce personalized emotional styles from speech, with a partitioned style-guided decoder. Dataset: to tackle data scarcity, the authors construct 3D-EmoStyle using pseudo-blendshape labels and deformation transfer to FLAME meshes. Findings: PESTalk achieves state-of-the-art performance on multiple metrics for both lip synchronization and emotion reproduction, with strong qualitative results and positive user studies. Impact: outputs via blendshape coefficients enable seamless integration into industry pipelines and engines like Unreal/MetaHuman, advancing realistic digital humans.

Abstract

PESTalk is a novel method for generating 3D facial animations with personalized emotional styles directly from speech. It overcomes key limitations of existing approaches by introducing a Dual-Stream Emotion Extractor (DSEE) that captures both time and frequency-domain audio features for fine-grained emotion analysis, and an Emotional Style Modeling Module (ESMM) that models individual expression patterns based on voiceprint characteristics. To address data scarcity, the method leverages a newly constructed 3D-EmoStyle dataset. Evaluations demonstrate that PESTalk outperforms state-of-the-art methods in producing realistic and personalized facial animations.

PESTalk: Speech-Driven 3D Facial Animation with Personalized Emotional Styles

TL;DR

Problem: existing speech-driven 3D facial animation often focuses on lip-sync and lacks realistic emotional expression and personalization. Approach: PESTalk introduces a Dual-Stream Emotion Extractor (time- and frequency-domain features) and an Emotional Style Modeling Module to produce personalized emotional styles from speech, with a partitioned style-guided decoder. Dataset: to tackle data scarcity, the authors construct 3D-EmoStyle using pseudo-blendshape labels and deformation transfer to FLAME meshes. Findings: PESTalk achieves state-of-the-art performance on multiple metrics for both lip synchronization and emotion reproduction, with strong qualitative results and positive user studies. Impact: outputs via blendshape coefficients enable seamless integration into industry pipelines and engines like Unreal/MetaHuman, advancing realistic digital humans.

Abstract

PESTalk is a novel method for generating 3D facial animations with personalized emotional styles directly from speech. It overcomes key limitations of existing approaches by introducing a Dual-Stream Emotion Extractor (DSEE) that captures both time and frequency-domain audio features for fine-grained emotion analysis, and an Emotional Style Modeling Module (ESMM) that models individual expression patterns based on voiceprint characteristics. To address data scarcity, the method leverages a newly constructed 3D-EmoStyle dataset. Evaluations demonstrate that PESTalk outperforms state-of-the-art methods in producing realistic and personalized facial animations.

Paper Structure

This paper contains 16 sections, 12 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) When expressing the same emotion with identical sentences, people show different facial expression patterns. This is influenced by habitual behaviors and cultural backgrounds, known as personalized emotional styles. (b) Prior work (e.g., EMOTEdanvevcek2023emotional) treats emotion as one-hot embeddings, producing averaged expressions, while PESTalk dynamically selects optimal expressions by analyzing emotions and voiceprints, generating realistic 3D animations with personalized styles.
  • Figure 2: Overview of PESTalk. Given a speech input ${A}_{1: T}$, PESTalk extracts emotion features $E_{1: T}$ and voiceprint information $V_{1: T}$. Then, it uses these two types of information to query the emotional style library and match the closest personalized emotional style $S_{1: T}$. After that, PESTalk extracts content features $C_{1: T}$ from the speech and integrates these features. Subsequently, two distinct sets of decoders, $\boldsymbol{\Phi}_{D}^{up}$ and $\boldsymbol{\Phi}_{D}^{low}$, combine these integrated features to generate facial blendshape coefficients for the upper and lower face, respectively. Additionally, these coefficients also can be used to animate a FLAME model.
  • Figure 3: Pairwise Disentanglement Mechanism. The content extractor $\boldsymbol{\Phi}_{C}$ and emotion extractor $\boldsymbol{\Phi}_{E}$ project paired audios (same semantics, different emotions) into content and emotion latent spaces, respectively. The mechanism pulls close semantically identical features while pushing apart emotionally distinct ones.
  • Figure 4: Samples of reference meshes from facial blendshape coefficients to FLAME mesh.
  • Figure 5: Visual comparison of facial movements generated by different methods on the 3D-EmoStyle test set. The results demonstrate samples of distinct emotional expressions. Compared with other approaches, our method produces more emotionally expressive and realistic facial animations.
  • ...and 2 more figures