EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation
Guanwen Feng, Haoran Cheng, Yunan Li, Zhiyuan Ma, Chaoneng Li, Zhihao Qian, Qiguang Miao, Chi-Man Pun
TL;DR
EmoSpeaker tackles one-shot fine-grained emotion-controlled talking-face generation by decoupling audio content from emotion via a Visual Attribute-Guided Audio Decoupler and injecting emotion through a Fine-grained Emotion Coefficient Prediction Module. The system leverages a 3D Morphable Model backbone and a MappingNet-based Emotion Face Renderer to output photorealistic videos with controllable emotional categories and intensities. Key innovations include AU-guided contrastive learning to remove emotional content from audio, and a fine-grained emotion intensity matrix that enables unseen emotional intensities. Experiments on MEAD, CREMA-D, and HDTF show improved lip synchronization and richer emotional expression over state-of-the-art methods, with thorough ablations validating each component. This work advances expressive, controllable synthetic facial animation for applications in virtual humans, while acknowledging ethical considerations and the need for detection and misuse-prevention measures.
Abstract
Implementing fine-grained emotion control is crucial for emotion generation tasks because it enhances the expressive capability of the generative model, allowing it to accurately and comprehensively capture and express various nuanced emotional states, thereby improving the emotional quality and personalization of generated content. Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. In order to address this challenge, we propose a visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module. Additionally, we propose an emotion intensity control method using a fine-grained emotion matrix. Through these, effective control over emotional expression in the generated videos and finer classification of emotion intensity are accomplished. Subsequently, a series of 3DMM coefficient generation networks are designed to predict 3D coefficients, followed by the utilization of a rendering network to generate the final video. Our experimental results demonstrate that our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization. Project page: https://peterfanfan.github.io/EmoSpeaker/
