Table of Contents
Fetching ...

EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation

Yanchao Zhao, Jihao Zhu, Yu Liu, Weizhuo Chen, Yuling Yang, Kun Peng

TL;DR

This work tackles the lack of natural emotional expressiveness in sign language generation by introducing EASL, a multi-emotion-guided framework that disentangles semantic and emotional representations. The architecture comprises DESE, which separately encodes semantics and emotions, and EGSID, which uses emotion-guided interaction to decode poses and emotion confidence across seven classes. A three-phase progressive training strategy prevents feature entanglement and jointly refines semantic and emotional guidance, achieving superior pose accuracy and expressive realism, with demonstrated compatibility with diffusion-based video generation. The approach is validated on PHOENIX14T and Prompt2Sign, showing clear gains over strong baselines and robust ablation results that confirm the complementary roles of global emotional constraints and fine-grained emotion–semantic interaction.

Abstract

Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.

EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation

TL;DR

This work tackles the lack of natural emotional expressiveness in sign language generation by introducing EASL, a multi-emotion-guided framework that disentangles semantic and emotional representations. The architecture comprises DESE, which separately encodes semantics and emotions, and EGSID, which uses emotion-guided interaction to decode poses and emotion confidence across seven classes. A three-phase progressive training strategy prevents feature entanglement and jointly refines semantic and emotional guidance, achieving superior pose accuracy and expressive realism, with demonstrated compatibility with diffusion-based video generation. The approach is validated on PHOENIX14T and Prompt2Sign, showing clear gains over strong baselines and robust ablation results that confirm the complementary roles of global emotional constraints and fine-grained emotion–semantic interaction.

Abstract

Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.

Paper Structure

This paper contains 15 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: EASL demonstrating accuracy improvements over baseline methods (bottom left) while applying appropriate emotional expressions to identical semantic content across different contextual scenes (raised) through multi-emotion guidance with confidence scores.
  • Figure 2: EASL architecture with three-phase training strategy.
  • Figure 3: Emotion similarity $\rho(E_t, E_{BERT})$ across epochs.
  • Figure 4: Case: I am very pleased ... to meet you today.