EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion
Jian Zhang, Weijian Mai, Zhijun Zhang
TL;DR
EMOdiffhead tackles emotion-driven talking head generation with fine-grained, continuous emotion control and one-shot generation. It leverages DECA-derived FLAME expression vectors as explicit emotion conditions, conditioning a diffusion-based video synthesis model on audio and identity features via ReferenceNet. The approach learns rich facial information from emotion-irrelevant data by encoding expressions from FLAME and uses an LSTM-based expression generator with discriminators to produce emotion trajectories. Experimental results on MEAD and HDTF demonstrate state-of-the-art emotion editing accuracy, lip synchronization, and video realism, with strong generalization to unseen identities.
Abstract
The task of audio-driven portrait animation involves generating a talking head video using an identity image and an audio track of speech. While many existing approaches focus on lip synchronization and video quality, few tackle the challenge of generating emotion-driven talking head videos. The ability to control and edit emotions is essential for producing expressive and realistic animations. In response to this challenge, we propose EMOdiffhead, a novel method for emotional talking head video generation that not only enables fine-grained control of emotion categories and intensities but also enables one-shot generation. Given the FLAME 3D model's linearity in expression modeling, we utilize the DECA method to extract expression vectors, that are combined with audio to guide a diffusion model in generating videos with precise lip synchronization and rich emotional expressiveness. This approach not only enables the learning of rich facial information from emotion-irrelevant data but also facilitates the generation of emotional videos. It effectively overcomes the limitations of emotional data, such as the lack of diversity in facial and background information, and addresses the absence of emotional details in emotion-irrelevant data. Extensive experiments and user studies demonstrate that our approach achieves state-of-the-art performance compared to other emotion portrait animation methods.
