Table of Contents
Fetching ...

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Jian Zhang, Weijian Mai, Zhijun Zhang

TL;DR

EMOdiffhead tackles emotion-driven talking head generation with fine-grained, continuous emotion control and one-shot generation. It leverages DECA-derived FLAME expression vectors as explicit emotion conditions, conditioning a diffusion-based video synthesis model on audio and identity features via ReferenceNet. The approach learns rich facial information from emotion-irrelevant data by encoding expressions from FLAME and uses an LSTM-based expression generator with discriminators to produce emotion trajectories. Experimental results on MEAD and HDTF demonstrate state-of-the-art emotion editing accuracy, lip synchronization, and video realism, with strong generalization to unseen identities.

Abstract

The task of audio-driven portrait animation involves generating a talking head video using an identity image and an audio track of speech. While many existing approaches focus on lip synchronization and video quality, few tackle the challenge of generating emotion-driven talking head videos. The ability to control and edit emotions is essential for producing expressive and realistic animations. In response to this challenge, we propose EMOdiffhead, a novel method for emotional talking head video generation that not only enables fine-grained control of emotion categories and intensities but also enables one-shot generation. Given the FLAME 3D model's linearity in expression modeling, we utilize the DECA method to extract expression vectors, that are combined with audio to guide a diffusion model in generating videos with precise lip synchronization and rich emotional expressiveness. This approach not only enables the learning of rich facial information from emotion-irrelevant data but also facilitates the generation of emotional videos. It effectively overcomes the limitations of emotional data, such as the lack of diversity in facial and background information, and addresses the absence of emotional details in emotion-irrelevant data. Extensive experiments and user studies demonstrate that our approach achieves state-of-the-art performance compared to other emotion portrait animation methods.

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

TL;DR

EMOdiffhead tackles emotion-driven talking head generation with fine-grained, continuous emotion control and one-shot generation. It leverages DECA-derived FLAME expression vectors as explicit emotion conditions, conditioning a diffusion-based video synthesis model on audio and identity features via ReferenceNet. The approach learns rich facial information from emotion-irrelevant data by encoding expressions from FLAME and uses an LSTM-based expression generator with discriminators to produce emotion trajectories. Experimental results on MEAD and HDTF demonstrate state-of-the-art emotion editing accuracy, lip synchronization, and video realism, with strong generalization to unseen identities.

Abstract

The task of audio-driven portrait animation involves generating a talking head video using an identity image and an audio track of speech. While many existing approaches focus on lip synchronization and video quality, few tackle the challenge of generating emotion-driven talking head videos. The ability to control and edit emotions is essential for producing expressive and realistic animations. In response to this challenge, we propose EMOdiffhead, a novel method for emotional talking head video generation that not only enables fine-grained control of emotion categories and intensities but also enables one-shot generation. Given the FLAME 3D model's linearity in expression modeling, we utilize the DECA method to extract expression vectors, that are combined with audio to guide a diffusion model in generating videos with precise lip synchronization and rich emotional expressiveness. This approach not only enables the learning of rich facial information from emotion-irrelevant data but also facilitates the generation of emotional videos. It effectively overcomes the limitations of emotional data, such as the lack of diversity in facial and background information, and addresses the absence of emotional details in emotion-irrelevant data. Extensive experiments and user studies demonstrate that our approach achieves state-of-the-art performance compared to other emotion portrait animation methods.
Paper Structure (24 sections, 14 equations, 7 figures, 4 tables)

This paper contains 24 sections, 14 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Given an emotion label and specified intensity, our method first generates an expression vector. This vector is then combined with the audio and the target identity image to synthesize a video that aligns with the specified emotion and intensity.
  • Figure 2: Pipeline of our method. Given an emotion label with a specified intensity, our method first generates FLAME-based expression vectors. Next, these vectors are then combined with a reference image and audio to condition the diffusion model for generating the target video. (a) Backbone Network: The time-based diffusion model receives image, audio, and expression vector input to synthesize the target video. (b) ReferenceNet: Another UNet with a similar structure to the backbone network is used to extract features of a reference image for maintaining identity consistency. (c) Emotion Editing Condition Generation Module: Manually specify emotion category and intensity. During inference, the editing direction between the target non-neutral emotion and the neutral emotion and intensity value are utilized to obtain the final expression vectors.
  • Figure 3: The FLAME model's linear editing characteristics allow for gradual changes in emotion intensity: as the natural emotion vector shifts toward the unnatural emotion vector, the intensity of the unnatural emotion progressively increases. The numbers in the figure represent the strength values.
  • Figure 4: The comparison of our model and other state-of-the-art models for emotional talking face generation. The emotion intensity value for all methods is set to 1 or the strongest.
  • Figure 5: Editing results of different facial expression types and intensities generated by our method.
  • ...and 2 more figures