When Words Smile: Generating Diverse Emotional Facial Expressions from Text

Haidong Xu; Meishan Zhang; Hao Ju; Zhedong Zheng; Erik Cambria; Min Zhang; Hao Fei

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

Haidong Xu, Meishan Zhang, Hao Ju, Zhedong Zheng, Erik Cambria, Min Zhang, Hao Fei

TL;DR

This work tackles the challenge of generating diverse, emotionally coherent facial expressions directly from text for digital humans by proposing CTEG, an end-to-end Continuous Text-to-Expression Generator. CTEG employs an Expression-wise Attention encoder and a Conditional Variational Autoregressive Decoder with Latent Temporal Attention to model a continuous latent space of expressions, enabling fluid, contextually appropriate sequences of 3D facial expressions via FLAME coefficients ($d=53$). The EmoAva dataset of 15,000 text–3D expression pairs provides a large, high-quality foundation for training and evaluation, with extensive metrics (diversity, multimodality, variation, fine-grained diversity, and continuous perplexity) and human assessments confirming improved emotion-content consistency over baselines like LM-Listener. Ablation studies highlight the importance of EwA, LTA, and the target-guided loss in maintaining diversity and emotional alignment. The approach advances realistic, emotionally aware digital humans and highlights future directions in multilingual data, personalization, and robust handling of emotional uncertainty.

Abstract

Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text-3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

TL;DR

Abstract

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)