Text-Driven Emotionally Continuous Talking Face Generation

Hao Yang; Yanyan Zhao; Tian Zheng; Hongbo Zhang; Bichen Wang; Di Wu; Xing Fu; Xuda Zhi; Yongbo Huang; Hao He

Text-Driven Emotionally Continuous Talking Face Generation

Hao Yang, Yanyan Zhao, Tian Zheng, Hongbo Zhang, Bichen Wang, Di Wu, Xing Fu, Xuda Zhi, Yongbo Huang, Hao He

TL;DR

A customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), is introduced, which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos.

Abstract

Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method's exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.

Text-Driven Emotionally Continuous Talking Face Generation

TL;DR

Abstract

Paper Structure (29 sections, 9 equations, 6 figures, 8 tables)

This paper contains 29 sections, 9 equations, 6 figures, 8 tables.

Introduction
Related Work
Emotional Talking Face Generation
Text-Driven Talking Face Generation
Method
Overview
Emotional Audio Generation
Temporal-Intensive Emotion Fluctuation Modeling
Emotion Fluctuation Guided Visual Synthesis
Training and Inference
Experiment
Experimental Setups
Dataset.
Evaluation Metrics.
Quantitative Results
...and 14 more sections

Figures (6)

Figure 1: Comparison between (a) the traditional audio-driven emotional TFG task and (b) the proposed EC-TFG task.
Figure 2: The Overview of TIE-TFG model. Specifically, our framework first utilizes a large-scale TTS model to generate emotional audio based on the text content and emotion description. Next, the emotion fluctuation prediction (EFP) module infers the emotional fluctuations from the audio and text input, with frame-level facial emotion labels serving as its training targets. Finally, we incorporate the obtained emotional fluctuation features into the talking face generation model via a cross-attention mechanism to produce a video with continuous emotional expression.
Figure 3: Video generation results of the proposed approach given different emotion description.
Figure 4: Comparison of our method with existing emotion control approaches. The top part illustrates the predicted emotional fluctuations versus the pseudo-emotional labels from the reference video. The bottom part displays the generated results from different emotion control methods.
Figure 5: User studies on the quality of generated talking-face results.
...and 1 more figures

Text-Driven Emotionally Continuous Talking Face Generation

TL;DR

Abstract

Text-Driven Emotionally Continuous Talking Face Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)