Table of Contents
Fetching ...

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face Animation

Yihong Lin, Liang Peng, Zhaoxin Fan, Xianjia Wu, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei

TL;DR

EmoFace tackles the challenge of integrating emotional expression into 3D talking-face animation by disentangling emotion and content from speech into separate branches and fusing them through Mesh Attention powered by SpiralConv3D. The model is trained with a self-growing scheme and intermediate supervision to reduce error accumulation and improve robustness, and is evaluated on the VOCASET and the newly created 3D-RAVDESS dataset, achieving state-of-the-art lip synchronization and full-emotion facial motion. The work also introduces 3D-RAVDESS as a high-quality emotional 3D facial dataset reconstructed from 2D data, enabling rigorous evaluation of emotional expressiveness. Across quantitative metrics (LVE, EVE), qualitative visualizations, and user studies (MOS), EmoFace demonstrates superior realism and emotion-accurate facial dynamics, with ablations confirming the contributions of Mesh Attention and SpiralConv3D as well as the effectiveness of the self-growing training strategy.

Abstract

The creation of increasingly vivid 3D talking face has become a hot topic in recent years. Currently, most speech-driven works focus on lip synchronisation but neglect to effectively capture the correlations between emotions and facial motions. To address this problem, we propose a two-stream network called EmoFace, which consists of an emotion branch and a content branch. EmoFace employs a novel Mesh Attention mechanism to analyse and fuse the emotion features and content features. Particularly, a newly designed spatio-temporal graph-based convolution, SpiralConv3D, is used in Mesh Attention to learn potential temporal and spatial feature dependencies between mesh vertices. In addition, to the best of our knowledge, it is the first time to introduce a new self-growing training scheme with intermediate supervision to dynamically adjust the ratio of groundtruth adopted in the 3D face animation task. Comprehensive quantitative and qualitative evaluations on our high-quality 3D emotional facial animation dataset, 3D-RAVDESS ($4.8863\times 10^{-5}$mm for LVE and $0.9509\times 10^{-5}$mm for EVE), together with the public dataset VOCASET ($2.8669\times 10^{-5}$mm for LVE and $0.4664\times 10^{-5}$mm for EVE), demonstrate that our approach achieves state-of-the-art performance.

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face Animation

TL;DR

EmoFace tackles the challenge of integrating emotional expression into 3D talking-face animation by disentangling emotion and content from speech into separate branches and fusing them through Mesh Attention powered by SpiralConv3D. The model is trained with a self-growing scheme and intermediate supervision to reduce error accumulation and improve robustness, and is evaluated on the VOCASET and the newly created 3D-RAVDESS dataset, achieving state-of-the-art lip synchronization and full-emotion facial motion. The work also introduces 3D-RAVDESS as a high-quality emotional 3D facial dataset reconstructed from 2D data, enabling rigorous evaluation of emotional expressiveness. Across quantitative metrics (LVE, EVE), qualitative visualizations, and user studies (MOS), EmoFace demonstrates superior realism and emotion-accurate facial dynamics, with ablations confirming the contributions of Mesh Attention and SpiralConv3D as well as the effectiveness of the self-growing training strategy.

Abstract

The creation of increasingly vivid 3D talking face has become a hot topic in recent years. Currently, most speech-driven works focus on lip synchronisation but neglect to effectively capture the correlations between emotions and facial motions. To address this problem, we propose a two-stream network called EmoFace, which consists of an emotion branch and a content branch. EmoFace employs a novel Mesh Attention mechanism to analyse and fuse the emotion features and content features. Particularly, a newly designed spatio-temporal graph-based convolution, SpiralConv3D, is used in Mesh Attention to learn potential temporal and spatial feature dependencies between mesh vertices. In addition, to the best of our knowledge, it is the first time to introduce a new self-growing training scheme with intermediate supervision to dynamically adjust the ratio of groundtruth adopted in the 3D face animation task. Comprehensive quantitative and qualitative evaluations on our high-quality 3D emotional facial animation dataset, 3D-RAVDESS (mm for LVE and mm for EVE), together with the public dataset VOCASET (mm for LVE and mm for EVE), demonstrate that our approach achieves state-of-the-art performance.
Paper Structure (33 sections, 7 equations, 7 figures, 5 tables)

This paper contains 33 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The overall framework of EmoFace. The emotion and content branches disentangle the information in speech, while Mesh Attention fuses these two branches to obtain the final result. The entire framework is end-to-end, thus allowing for efficient training and inference. $\hat{M}_i$ denotes the $i$-th predicted motion and $M_i$ denotes the $i$-th reference motion. $k$ and $\delta$ represent the spatial and the temporal neighbourhood of SpiralConv3D respectively.
  • Figure 2: Self-growing scheme. In the first stage, the model inputs all groundtruth frames at once and directly predicts the next frame corresponding to each input frame. In the second stage, the input is changed to a fusion of the groundtruth frames and the predicted frames from the previous stage, and the same prediction process is applied to obtain the final prediction results.
  • Figure 3: Qualitative comparison of the facial movements of the different methods on 3D-RAVDESS (left) and VOCASET (right). On 3D-RAVDESS, we generate facial animations of saying the sentence “Kids are talking by the door.” with surprised. On VOCA-Test, facial animations of saying the sentence “How many crystal modifications of uranium hydride are extent?” without emotion are generated. Significant differences in the lip region are denoted by red boxes. EmoFace generates more realistic facial movements that match the speech, whether it's emotional or not.
  • Figure 4: Visualization of the importance of emotional information for facial regions. The eyes, mouth and jaw are strongly correlated with emotions.
  • Figure 5: Supervision training strategy of emotion-content disentanglement module. Various inputs of speech, conveying same content and different emotions, are processed to generate cross-reconstructed mesh vertex offsets representing distinct combinations of facial expressions. Supervisions are added to both two branches and the final output.
  • ...and 2 more figures