EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

Bingyuan Zhang; Xulong Zhang; Ning Cheng; Jun Yu; Jing Xiao; Jianzong Wang

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

Bingyuan Zhang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao, Jianzong Wang

TL;DR

EmoTalker tackles the challenge of generating emotionally expressive talking faces with strong identity preservation and flexible multi-emotion editing. It introduces a diffusion-based conditional framework where an Emotion Intensity Block maps textual prompts to nuanced emotional embeddings and a cross-attention conditioned generator steers the denoising process, while preserving identity during inference through latent-space constraints involving $Z_T$ and $\hat{z}_t$. A new FED dataset supports learning nuanced emotional descriptions, enabling complex emotion guidance from prompts. Experimental results on MEAD and CREMA-D show competitive identity preservation and superior emotion accuracy, including expressions from prompts with mixed emotions. This work enables fine-grained, controllable emotional expressions for avatar-like agents with practical impact on expressive avatars and human–computer interaction.

Abstract

In recent years, the field of talking faces generation has attracted considerable attention, with certain methods adept at generating virtual faces that convincingly imitate human expressions. However, existing methods face challenges related to limited generalization, particularly when dealing with challenging identities. Furthermore, methods for editing expressions are often confined to a singular emotion, failing to adapt to intricate emotions. To overcome these challenges, this paper proposes EmoTalker, an emotionally editable portraits animation approach based on the diffusion model. EmoTalker modifies the denoising process to ensure preservation of the original portrait's identity during inference. To enhance emotion comprehension from text input, Emotion Intensity Block is introduced to analyze fine-grained emotions and strengths derived from prompts. Additionally, a crafted dataset is harnessed to enhance emotion comprehension within prompts. Experiments show the effectiveness of EmoTalker in generating high-quality, emotionally customizable facial expressions.

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

TL;DR

and

. A new FED dataset supports learning nuanced emotional descriptions, enabling complex emotion guidance from prompts. Experimental results on MEAD and CREMA-D show competitive identity preservation and superior emotion accuracy, including expressions from prompts with mixed emotions. This work enables fine-grained, controllable emotional expressions for avatar-like agents with practical impact on expressive avatars and human–computer interaction.

Abstract

Paper Structure (12 sections, 6 equations, 3 figures, 2 tables)

This paper contains 12 sections, 6 equations, 3 figures, 2 tables.

Introduction
Method
Diffusion-based Reconstruction Module
Aligned Multi-modal Condition Generator
Loss Function and Inference
Experiment
Experiment Dataset and Evaluation Metrics
Comparison with Methods of the SOTA
Expressions Generation via Prompts Containing Complex Emotions
Ablation Study
Conclusion
Acknowledgement

Figures (3)

Figure 1: The overall framework of EmoTalker. The condition generator produces conditions to guide the denoising process. And we make the emotions of the generated images approach to those of the prompts.
Figure 2: Expressions generation with different prompts.
Figure 3: Expressions generation with different strengths.

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

TL;DR

Abstract

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (3)