Table of Contents
Fetching ...

Gender Bias in Emotion Recognition by Large Language Models

Maureen Herbert, Katie Sun, Angelica Lim, Yasaman Etesam

TL;DR

This work investigates gender bias in emotion recognition by large language models using context-rich NarraCap captions derived from EMOTIC, comparing multiple models and debiasing strategies. The authors define an equal-emission baseline and assess bias through chi-square tests and per-emotion distributions, finding that inference-time prompt methods are generally ineffective. Training-based debiasing via data augmentation and fine-tuning (FT1, FT2) substantially reduces detectable bias across emotions, though model- and emotion-specific variability remains. The study highlights practical implications for deploying LLMs in emotion-aware applications and underscores the value of training-time interventions for fairness in emotional theory of mind. It also cautions about limitations related to dataset scope, gender representation, and environmental costs of model training.

Abstract

The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, "How does this person feel?". Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.

Gender Bias in Emotion Recognition by Large Language Models

TL;DR

This work investigates gender bias in emotion recognition by large language models using context-rich NarraCap captions derived from EMOTIC, comparing multiple models and debiasing strategies. The authors define an equal-emission baseline and assess bias through chi-square tests and per-emotion distributions, finding that inference-time prompt methods are generally ineffective. Training-based debiasing via data augmentation and fine-tuning (FT1, FT2) substantially reduces detectable bias across emotions, though model- and emotion-specific variability remains. The study highlights practical implications for deploying LLMs in emotion-aware applications and underscores the value of training-time interventions for fairness in emotional theory of mind. It also cautions about limitations related to dataset scope, gender representation, and environmental costs of model training.

Abstract

The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, "How does this person feel?". Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.

Paper Structure

This paper contains 28 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An EMOTIC image with the corresponding NarraCap caption, along with swapped and undefined gender versions. GT represents the ground truth emotion labels chosen by annotators.
  • Figure 2: This figure shows the frequency of emotion labels predicted by GPT-4, GPT-5, Mistral, and Tiny LLaMA for captions with man (blue), woman (orange), and undefined (green) genders. To better illustrate the differences across genders, the predictions were normalized based on each emotion label.