EmoFace: Audio-driven Emotional 3D Face Animation
Chang Liu, Qunfen Lin, Zijiao Zeng, Ye Pan
TL;DR
EmoFace tackles the challenge of generating emotionally expressive 3D facial animation driven by audio input, addressing the gap where prior work focused primarily on lip synchronization without robust emotion modeling. By using independent wav2vec2.0-based audio encoding and an emotion encoder, plus a transformer-based Audio2Rig mapper, EmoFace outputs a 174-dimensional rig sequence for MetaHuman characters, complemented by dedicated blink and gaze controllers. The authors introduce a Chinese emotional audio-visual dataset with ground-truth rig parameters and perform extensive quantitative, qualitative, and user studies, showing superior naturalness and emotion accuracy compared with state-of-the-art baselines. The work enables efficient production of emotionally rich NPCs and avatars in VR/games, with practical impact for immersive digital experiences.
Abstract
Audio-driven emotional 3D face animation aims to generate emotionally expressive talking heads with synchronized lip movements. However, previous research has often overlooked the influence of diverse emotions on facial expressions or proved unsuitable for driving MetaHuman models. In response to this deficiency, we introduce EmoFace, a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements, while maintaining accurate lip synchronization. We propose independent speech encoders and emotion encoders to learn the relationship between audio, emotion and corresponding facial controller rigs, and finally map into the sequence of controller values. Additionally, we introduce two post-processing techniques dedicated to enhancing the authenticity of the animation, particularly in blinks and eye movements. Furthermore, recognizing the scarcity of emotional audio-visual data suitable for MetaHuman model manipulation, we contribute an emotional audio-visual dataset and derive control parameters for each frames. Our proposed methodology can be applied in producing dialogues animations of non-playable characters (NPCs) in video games, and driving avatars in virtual reality environments. Our further quantitative and qualitative experiments, as well as an user study comparing with existing researches show that our approach demonstrates superior results in driving 3D facial models. The code and sample data are available at https://github.com/SJTU-Lucy/EmoFace.
