PESTalk: Speech-Driven 3D Facial Animation with Personalized Emotional Styles
Tianshun Han, Benjia Zhou, Ajian Liu, Yanyan Liang, Du Zhang, Zhen Lei, Jun Wan
TL;DR
Problem: existing speech-driven 3D facial animation often focuses on lip-sync and lacks realistic emotional expression and personalization. Approach: PESTalk introduces a Dual-Stream Emotion Extractor (time- and frequency-domain features) and an Emotional Style Modeling Module to produce personalized emotional styles from speech, with a partitioned style-guided decoder. Dataset: to tackle data scarcity, the authors construct 3D-EmoStyle using pseudo-blendshape labels and deformation transfer to FLAME meshes. Findings: PESTalk achieves state-of-the-art performance on multiple metrics for both lip synchronization and emotion reproduction, with strong qualitative results and positive user studies. Impact: outputs via blendshape coefficients enable seamless integration into industry pipelines and engines like Unreal/MetaHuman, advancing realistic digital humans.
Abstract
PESTalk is a novel method for generating 3D facial animations with personalized emotional styles directly from speech. It overcomes key limitations of existing approaches by introducing a Dual-Stream Emotion Extractor (DSEE) that captures both time and frequency-domain audio features for fine-grained emotion analysis, and an Emotional Style Modeling Module (ESMM) that models individual expression patterns based on voiceprint characteristics. To address data scarcity, the method leverages a newly constructed 3D-EmoStyle dataset. Evaluations demonstrate that PESTalk outperforms state-of-the-art methods in producing realistic and personalized facial animations.
