Table of Contents
Fetching ...

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

Haidong Xu, Meishan Zhang, Hao Ju, Zhedong Zheng, Erik Cambria, Min Zhang, Hao Fei

TL;DR

This work tackles the challenge of generating diverse, emotionally coherent facial expressions directly from text for digital humans by proposing CTEG, an end-to-end Continuous Text-to-Expression Generator. CTEG employs an Expression-wise Attention encoder and a Conditional Variational Autoregressive Decoder with Latent Temporal Attention to model a continuous latent space of expressions, enabling fluid, contextually appropriate sequences of 3D facial expressions via FLAME coefficients ($d=53$). The EmoAva dataset of 15,000 text–3D expression pairs provides a large, high-quality foundation for training and evaluation, with extensive metrics (diversity, multimodality, variation, fine-grained diversity, and continuous perplexity) and human assessments confirming improved emotion-content consistency over baselines like LM-Listener. Ablation studies highlight the importance of EwA, LTA, and the target-guided loss in maintaining diversity and emotional alignment. The approach advances realistic, emotionally aware digital humans and highlights future directions in multilingual data, personalization, and robust handling of emotional uncertainty.

Abstract

Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text-3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

TL;DR

This work tackles the challenge of generating diverse, emotionally coherent facial expressions directly from text for digital humans by proposing CTEG, an end-to-end Continuous Text-to-Expression Generator. CTEG employs an Expression-wise Attention encoder and a Conditional Variational Autoregressive Decoder with Latent Temporal Attention to model a continuous latent space of expressions, enabling fluid, contextually appropriate sequences of 3D facial expressions via FLAME coefficients (). The EmoAva dataset of 15,000 text–3D expression pairs provides a large, high-quality foundation for training and evaluation, with extensive metrics (diversity, multimodality, variation, fine-grained diversity, and continuous perplexity) and human assessments confirming improved emotion-content consistency over baselines like LM-Listener. Ablation studies highlight the importance of EwA, LTA, and the target-guided loss in maintaining diversity and emotional alignment. The approach advances realistic, emotionally aware digital humans and highlights future directions in multilingual data, personalization, and robust handling of emotional uncertainty.

Abstract

Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text-3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.

Paper Structure

This paper contains 65 sections, 18 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Top: The existing pipeline for synthesizing emotional avatars, which can only generate limited expressions that lack of diversity. Bottom: The proposed end-to-end system that directly maps text to facial expressions (codes), aims to generate diverse, emotionally consistent, and temporally smooth expressions.
  • Figure 2: Architecture of the Continuous Text-to-Expression Generator (CTEG). Given a text, the model autoregressively generates a sequence of expression vectors. The rgb]0.924, 0.956, 0.902green block and rgb]0.992, 0.957, 0.933pink block represent the proposed Expression-wise Attention (EwA) module and the core Conditional Variational Autoregressive Decoder (CVAD) module, respectively.
  • Figure 3: Samples from EmoAva dataset. Each instance includes a textual dialogue spoken by an actor, a corresponding head video, and a sequence of 3D expression vectors (here visualized in 3D mesh).
  • Figure 4: A quantitative evaluation of user preferences regarding emotion-content consistency. The color bar from blue to red indicates preference levels from lowest to highest. Expressions from CTEG better match text emotions than those from baselines.
  • Figure 5: The effect of $\mathcal{L}_{g}$ loss (Eq. \ref{['eq:target_loss']}) on the KL term in Eq. \ref{['loss_cvad']}. $\mathcal{L}_{g}$ loss mitigates the rapid decrease of the KL term and prevents it from approaching zero.
  • ...and 8 more figures