EmoCAST: Emotional Talking Portrait via Emotive Text Description
Yiguo Jiang, Xiaodong Cun, Yong Zhang, Yudian Zheng, Fan Tang, Chi-Man Pun
TL;DR
EmoCAST addresses the challenge of producing emotionally expressive talking head videos with precise text-based control in real-world settings. It introduces a diffusion-based pipeline with two specialized attention modules—text-guided emotive attention and emotive audio attention—coupled with the ETTH in-the-wild dataset and two training strategies to improve expression fidelity and lip-sync. The approach achieves state-of-the-art emotion accuracy and robust audio-visual synchronization on both MEAD and in-the-wild evaluations, corroborated by a user study. This work advances practical, controllable emotional portrait synthesis with scalable data and training paradigms for real-world applications. The combination of text-driven control, emotion-aware training, and large-scale in-the-wild data positions EmoCAST for broad deployment in digital humans and interactive media.
Abstract
Emotional talking head synthesis aims to generate talking portrait videos with vivid expressions. Existing methods still exhibit limitations in control flexibility, motion naturalness, and expression quality. Moreover, currently available datasets are mainly collected in lab settings, further exacerbating these shortcomings and hindering real-world deployment. To address these challenges, we propose EmoCAST, a diffusion-based talking head framework for precise, text-driven emotional synthesis. Its contributions are threefold: (1) architectural modules that enable effective text control; (2) an emotional talking-head dataset that expands the framework's ability; and (3) training strategies that further improve performance. Specifically, for appearance modeling, emotional prompts are integrated through a text-guided emotive attention module, enhancing spatial knowledge to improve emotion understanding. To strengthen audio-emotion alignment, we introduce an emotive audio attention module to capture the interplay between controlled emotion and driving audio, generating emotion-aware features to guide precise facial motion synthesis. Additionally, we construct a large-scale, in-the-wild emotional talking head dataset with emotive text descriptions to optimize the framework's performance. Based on this dataset, we propose an emotion-aware sampling strategy and a progressive functional training strategy that improve the model's ability to capture nuanced expressive features and achieve accurate lip-sync. Overall, EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos. Project Page: https://github.com/GVCLab/EmoCAST
