Table of Contents
Fetching ...

Text-Driven Emotionally Continuous Talking Face Generation

Hao Yang, Yanyan Zhao, Tian Zheng, Hongbo Zhang, Bichen Wang, Di Wu, Xing Fu, Xuda Zhi, Yongbo Huang, Hao He

TL;DR

A customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), is introduced, which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos.

Abstract

Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method's exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.

Text-Driven Emotionally Continuous Talking Face Generation

TL;DR

A customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), is introduced, which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos.

Abstract

Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method's exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.
Paper Structure (29 sections, 9 equations, 6 figures, 8 tables)

This paper contains 29 sections, 9 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comparison between (a) the traditional audio-driven emotional TFG task and (b) the proposed EC-TFG task.
  • Figure 2: The Overview of TIE-TFG model. Specifically, our framework first utilizes a large-scale TTS model to generate emotional audio based on the text content and emotion description. Next, the emotion fluctuation prediction (EFP) module infers the emotional fluctuations from the audio and text input, with frame-level facial emotion labels serving as its training targets. Finally, we incorporate the obtained emotional fluctuation features into the talking face generation model via a cross-attention mechanism to produce a video with continuous emotional expression.
  • Figure 3: Video generation results of the proposed approach given different emotion description.
  • Figure 4: Comparison of our method with existing emotion control approaches. The top part illustrates the predicted emotional fluctuations versus the pseudo-emotional labels from the reference video. The bottom part displays the generated results from different emotion control methods.
  • Figure 5: User studies on the quality of generated talking-face results.
  • ...and 1 more figures