Table of Contents
Fetching ...

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

Guanwen Feng, Haoran Cheng, Yunan Li, Zhiyuan Ma, Chaoneng Li, Zhihao Qian, Qiguang Miao, Chi-Man Pun

TL;DR

EmoSpeaker tackles one-shot fine-grained emotion-controlled talking-face generation by decoupling audio content from emotion via a Visual Attribute-Guided Audio Decoupler and injecting emotion through a Fine-grained Emotion Coefficient Prediction Module. The system leverages a 3D Morphable Model backbone and a MappingNet-based Emotion Face Renderer to output photorealistic videos with controllable emotional categories and intensities. Key innovations include AU-guided contrastive learning to remove emotional content from audio, and a fine-grained emotion intensity matrix that enables unseen emotional intensities. Experiments on MEAD, CREMA-D, and HDTF show improved lip synchronization and richer emotional expression over state-of-the-art methods, with thorough ablations validating each component. This work advances expressive, controllable synthetic facial animation for applications in virtual humans, while acknowledging ethical considerations and the need for detection and misuse-prevention measures.

Abstract

Implementing fine-grained emotion control is crucial for emotion generation tasks because it enhances the expressive capability of the generative model, allowing it to accurately and comprehensively capture and express various nuanced emotional states, thereby improving the emotional quality and personalization of generated content. Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. In order to address this challenge, we propose a visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module. Additionally, we propose an emotion intensity control method using a fine-grained emotion matrix. Through these, effective control over emotional expression in the generated videos and finer classification of emotion intensity are accomplished. Subsequently, a series of 3DMM coefficient generation networks are designed to predict 3D coefficients, followed by the utilization of a rendering network to generate the final video. Our experimental results demonstrate that our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization. Project page: https://peterfanfan.github.io/EmoSpeaker/

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

TL;DR

EmoSpeaker tackles one-shot fine-grained emotion-controlled talking-face generation by decoupling audio content from emotion via a Visual Attribute-Guided Audio Decoupler and injecting emotion through a Fine-grained Emotion Coefficient Prediction Module. The system leverages a 3D Morphable Model backbone and a MappingNet-based Emotion Face Renderer to output photorealistic videos with controllable emotional categories and intensities. Key innovations include AU-guided contrastive learning to remove emotional content from audio, and a fine-grained emotion intensity matrix that enables unseen emotional intensities. Experiments on MEAD, CREMA-D, and HDTF show improved lip synchronization and richer emotional expression over state-of-the-art methods, with thorough ablations validating each component. This work advances expressive, controllable synthetic facial animation for applications in virtual humans, while acknowledging ethical considerations and the need for detection and misuse-prevention measures.

Abstract

Implementing fine-grained emotion control is crucial for emotion generation tasks because it enhances the expressive capability of the generative model, allowing it to accurately and comprehensively capture and express various nuanced emotional states, thereby improving the emotional quality and personalization of generated content. Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. In order to address this challenge, we propose a visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module. Additionally, we propose an emotion intensity control method using a fine-grained emotion matrix. Through these, effective control over emotional expression in the generated videos and finer classification of emotion intensity are accomplished. Subsequently, a series of 3DMM coefficient generation networks are designed to predict 3D coefficients, followed by the utilization of a rendering network to generate the final video. Our experimental results demonstrate that our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization. Project page: https://peterfanfan.github.io/EmoSpeaker/
Paper Structure (26 sections, 9 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 9 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: This is our proposed EmoSpeaker that generates speaker videos from a single image, driving audio, and specifying emotion category labels and fine-grained emotion intensity. We have marked the variation of different AU regions in different fine-grained.
  • Figure 2: We perform multi-level encoding on the driving audio to obtain content vectors. These vectors are then combined with coefficients from reference images, emotion type labels, and emotion intensity labels. The predicted coefficients are obtained. Finally, using MappingNet, we map these coefficients to driving motion coefficients, distort the reference images, and generate the final video. a. Source Coeff Extraction: Extract 68 facial keypoints and 3DMM coefficients from the reference image for training and generation purposes. b.Visual Attribute-Guided Audio Decoupler: Input the audio into three consecutive audio encoders to obtain separate low-level and high-level audio encodings. Utilizing a shared AU decoder to obtain AU-related features and compare them with AU coefficients extracted from the training videos for comparative learning. c.Fine-grained Emotion Coefficient Prediction Module: Manually specify emotion categories and intensity labels. During inference, adjust the sliding window size of the input audio to obtain a fine-grained emotion vector synchronized with the audio. Combine them with content vectors to predict expression, emotion, and pose coefficients through ExpNet, EmoNet and PoseNet. d. Emotion Face Renderer: Utilize the predicted 3DMM coefficients to generate motion vectors for latent facial keypoints, animating the facial image.
  • Figure 3: The flowchart of Fine-grained Emotion Coefficient Prediction. Audio sliding windows of varying sizes are utilized during the inference process. Different emotion categories and intensity labels are manually assigned. The predicted coefficients is obtained by EmoNet. Subsequently, the predicted coefficients of the last frame serve as the reference coefficients for the consecutive window.
  • Figure 4: We compare our method with the state-of-the-art emotion-driven facial expression generation methods such as EAMM, EVP, and MEAD, as well as the lip generation method Wav2lip. It is evident from the figure that our method exhibits superiority in various aspects including lip synchronization, pose reconstruction, and video quality.
  • Figure 5: One-shot comparative results on CREAM-D dataset.
  • ...and 4 more figures