Table of Contents
Fetching ...

EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language

Phoebe Chua, Cathy Mengying Fang, Takehiko Ohkawa, Raja Kushalnagar, Suranga Nanayakkara, Pattie Maes

TL;DR

EmoSign introduces the first multimodal ASL emotion dataset with sentiment and emotion labels for 200 ASL utterances, annotated by Deaf native signers and complemented by open-ended emotion cues. The work establishes three benchmark tasks (sentiment analysis, single-label emotion classification, and emotion cue grounding) and reports baseline results from several multimodal LLMs, revealing substantial challenges in visual-only emotion recognition that improve with textual captions. Findings highlight gaps in current models’ ability to disentangle linguistic versus affective signals in sign language, underscoring the need for ASL-specific model adaptations. By providing open access to EmoSign and its prompts, the paper offers a new benchmark and roadmap for advancing multimodal emotion understanding in sign languages and improving accessibility.

Abstract

Unlike spoken languages where the use of prosodic features to convey emotion is well studied, indicators of emotion in sign language remain poorly understood, creating communication barriers in critical settings. Sign languages present unique challenges as facial expressions and hand movements simultaneously serve both grammatical and emotional functions. To address this gap, we introduce EmoSign, the first sign video dataset containing sentiment and emotion labels for 200 American Sign Language (ASL) videos. We also collect open-ended descriptions of emotion cues. Annotations were done by 3 Deaf ASL signers with professional interpretation experience. Alongside the annotations, we include baseline models for sentiment and emotion classification. This dataset not only addresses a critical gap in existing sign language research but also establishes a new benchmark for understanding model capabilities in multimodal emotion recognition for sign languages. The dataset is made available at https://huggingface.co/datasets/catfang/emosign.

EmoSign: A Multimodal Dataset for Understanding Emotions in American Sign Language

TL;DR

EmoSign introduces the first multimodal ASL emotion dataset with sentiment and emotion labels for 200 ASL utterances, annotated by Deaf native signers and complemented by open-ended emotion cues. The work establishes three benchmark tasks (sentiment analysis, single-label emotion classification, and emotion cue grounding) and reports baseline results from several multimodal LLMs, revealing substantial challenges in visual-only emotion recognition that improve with textual captions. Findings highlight gaps in current models’ ability to disentangle linguistic versus affective signals in sign language, underscoring the need for ASL-specific model adaptations. By providing open access to EmoSign and its prompts, the paper offers a new benchmark and roadmap for advancing multimodal emotion understanding in sign languages and improving accessibility.

Abstract

Unlike spoken languages where the use of prosodic features to convey emotion is well studied, indicators of emotion in sign language remain poorly understood, creating communication barriers in critical settings. Sign languages present unique challenges as facial expressions and hand movements simultaneously serve both grammatical and emotional functions. To address this gap, we introduce EmoSign, the first sign video dataset containing sentiment and emotion labels for 200 American Sign Language (ASL) videos. We also collect open-ended descriptions of emotion cues. Annotations were done by 3 Deaf ASL signers with professional interpretation experience. Alongside the annotations, we include baseline models for sentiment and emotion classification. This dataset not only addresses a critical gap in existing sign language research but also establishes a new benchmark for understanding model capabilities in multimodal emotion recognition for sign languages. The dataset is made available at https://huggingface.co/datasets/catfang/emosign.

Paper Structure

This paper contains 24 sections, 31 figures, 5 tables.

Figures (31)

  • Figure 1: Duration distribution of the clips in the dataset. Dashed line indicates mean.
  • Figure 2: Distribution of sentiment labels. The labels correspond to the 7-point Likert scale where -3 is extremely negative, 0 is neutral, and 3 is extremely positive. Numbers above the bars indicate count.
  • Figure 3: Distribution of emotion categories based on binarized presence across clips.
  • Figure 4: Annotation Interface
  • Figure 5: Jaccard Similarity of the original set of Emotion Labels.
  • ...and 26 more figures