Table of Contents
Fetching ...

EmoCAST: Emotional Talking Portrait via Emotive Text Description

Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang, Chi-Man Pun

TL;DR

EmoCAST addresses the challenge of producing emotionally expressive talking head videos with precise text-based control in real-world settings. It introduces a diffusion-based pipeline with two specialized attention modules—text-guided emotive attention and emotive audio attention—coupled with the ETTH in-the-wild dataset and two training strategies to improve expression fidelity and lip-sync. The approach achieves state-of-the-art emotion accuracy and robust audio-visual synchronization on both MEAD and in-the-wild evaluations, corroborated by a user study. This work advances practical, controllable emotional portrait synthesis with scalable data and training paradigms for real-world applications. The combination of text-driven control, emotion-aware training, and large-scale in-the-wild data positions EmoCAST for broad deployment in digital humans and interactive media.

Abstract

Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework's ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion understanding. To strengthen audio-emotion alignment, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide precise facial motion synthesis. Additionally, we construct a large-scale, in-the-wild emotional talking head dataset with emotive text descriptions to optimize the framework's performance. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy that improve the model's ability to capture nuanced expressive features and achieve accurate lip-sync. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST

EmoCAST: Emotional Talking Portrait via Emotive Text Description

TL;DR

EmoCAST addresses the challenge of producing emotionally expressive talking head videos with precise text-based control in real-world settings. It introduces a diffusion-based pipeline with two specialized attention modules—text-guided emotive attention and emotive audio attention—coupled with the ETTH in-the-wild dataset and two training strategies to improve expression fidelity and lip-sync. The approach achieves state-of-the-art emotion accuracy and robust audio-visual synchronization on both MEAD and in-the-wild evaluations, corroborated by a user study. This work advances practical, controllable emotional portrait synthesis with scalable data and training paradigms for real-world applications. The combination of text-driven control, emotion-aware training, and large-scale in-the-wild data positions EmoCAST for broad deployment in digital humans and interactive media.

Abstract

Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework's ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion understanding. To strengthen audio-emotion alignment, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide precise facial motion synthesis. Additionally, we construct a large-scale, in-the-wild emotional talking head dataset with emotive text descriptions to optimize the framework's performance. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy that improve the model's ability to capture nuanced expressive features and achieve accurate lip-sync. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST

Paper Structure

This paper contains 16 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: We introduce EmoCAST, a novel diffusion-based emotional talking head system for in-the-wild images that incorporates flexible and customizable emotive text prompts. Compared with the previous state-of-the-art text-controlled emotional portrait animation method, i.e., InstructAvatar wang2024instructavatar, EmoCAST produces more vivid and accurate facial expressions with better identity preservation.
  • Figure 2: The main framework of the proposed EmoCAST, which has two pivotal modules designed for precise emotional synthesis: the text-guided emotive attention module and the emotive audio attention module. The text-guided emotive attention module ensures an accurate alignment between the generated facial expressions and the corresponding textual inputs. Concurrently, the emotive audio attention module facilitates the synthesis of facial motions that precisely reflect the emotional subtleties embedded in the driving speech.
  • Figure 3: Visual illustration of the two proposed training strategies. (a) Emotion-aware Sampling trains paired images between neutral expression and emotional expression to capture expression-specific features. (b) Progressive Functional Training improves the model's generalization capability, expression accuracy, and lip-synchronization in a phased, coarse-to-fine manner.
  • Figure 4: Visual comparison with other state-of-the-art methods for emotional talking video portraits on in-the-wild images. Our method consistently produces accurate facial expressions while maintaining precise lip synchronization that closely matches the ground truth mouth, along with robust identity preservation. For a more detailed examination, kindly enlarge the image or view the supplemental video.
  • Figure 5: Qualitative results of the ablation study for each design of our method. The emotion category is angry.