Table of Contents
Fetching ...

CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation

Xiangyu Liang, Wenlin Zhuang, Tianyong Wang, Guangxing Geng, Guangyue Geng, Haifeng Xia, Siyu Xia

TL;DR

CSTalk addresses the challenge of generating natural, emotion-aware speech-driven 3D facial animations by explicitly modeling correlations among facial regions with transformer encoders and supervising the generation process with these correlations. The framework uses a MetaHuman-based, 185-rig control system and a two-stage pipeline: a correlation module learns emotion-specific interactions between facial regions, and an autoencoder-based generator produces emotion-conditioned control-rig sequences fed by Wav2Vec 2.0 audio features and a TCN decoder. Empirical results on a newly collected dataset (five emotions, 100 samples per emotion) show improvements in lip-sync accuracy and expression realism, outperforming state-of-the-art methods such as FaceFormer and Emotalk. The approach demonstrates practical potential for industrial pipelines, enabling detailed, reusable animation parameters for MetaHuman avatars.

Abstract

Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.

CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation

TL;DR

CSTalk addresses the challenge of generating natural, emotion-aware speech-driven 3D facial animations by explicitly modeling correlations among facial regions with transformer encoders and supervising the generation process with these correlations. The framework uses a MetaHuman-based, 185-rig control system and a two-stage pipeline: a correlation module learns emotion-specific interactions between facial regions, and an autoencoder-based generator produces emotion-conditioned control-rig sequences fed by Wav2Vec 2.0 audio features and a TCN decoder. Empirical results on a newly collected dataset (five emotions, 100 samples per emotion) show improvements in lip-sync accuracy and expression realism, outperforming state-of-the-art methods such as FaceFormer and Emotalk. The approach demonstrates practical potential for industrial pipelines, enabling detailed, reusable animation parameters for MetaHuman avatars.

Abstract

Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations. The main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions. Although lip alignment has seen many related studies, existing methods struggle to synthesize natural and realistic expressions, resulting in a mechanical and stiff appearance of facial animations. Even with some research extracting emotional features from speech, the randomness of facial movements limits the effective expression of emotions. To address this issue, this paper proposes a method called CSTalk (Correlation Supervised) that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions that conform to human facial motion patterns. To generate more intricate animations, we employ a rich set of control parameters based on the metahuman character model and capture a dataset for five different emotions. We train a generative network using an autoencoder structure and input an emotion embedding vector to achieve the generation of user-control expressions. Experimental results demonstrate that our method outperforms existing state-of-the-art methods.
Paper Structure (11 sections, 4 equations, 4 figures, 2 tables)

This paper contains 11 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The pipeline of CSTalk. Our workflow consists of two stages. First, we input fixed length sequences of expressive speech animation data $S \in R^{R \times T}$ (where R is the amount of control rigs) to train the correlation module, along with the corresponding emotion labels. The module employs multiple transformer encoder layers to calculate attention weights, whose average is then fed into a linear layer to predict the emotion labels. In the second stage, an autoencoder takes audio data as input to generate a control rig sequence. In the decoder, the input of each TCN bai2018empirical layer is fused with the corresponding emotion embedding. The output is then passed through the pre-trained correlation module to predict the associated emotion.
  • Figure 2: Visualization of score matrices. (a) shows heatmaps of attention weights of different emotions, including four pieces of animation data for each emotion. (b) shows the visualization of the weight data that is subjected to dimensionality reduction using t-SNE.
  • Figure 3: Rendered results. Some frames of predicted animations in 5 emotions.
  • Figure 4: Qualitative comparison of the rendered animation frames. We compare our results with the SOTA methods in "happy" emotion, based on the same avatar.