Table of Contents
Fetching ...

EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Krishna Kalyan, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer

TL;DR

EmoNet-Face tackles the limited emotional repertoire and demographic biases of existing benchmarks by introducing a fine-grained 40-category emotion taxonomy grounded in established psychology. It delivers three synthetic, expert-annotated datasets—Big for pretraining, Binary for fine-tuning, and HQ for evaluation—featuring controlled demographic balance across ethnicity, age, and gender. The authors train EmpathicInsight-Face, a specialized model, achieving near-human performance on the HQ benchmark and demonstrate significant gaps in general-purpose VLMs for nuanced facial emotion recognition. By openly releasing the taxonomy, datasets, and models, the work provides a robust foundation for advancing emotion-aware AI while emphasizing ethical considerations and the need for multimodal integration in future research.

Abstract

Effective human-AI interaction relies on AI's ability to accurately perceive and interpret human emotions. Current benchmarks for vision and vision-language models are severely limited, offering a narrow emotional spectrum that overlooks nuanced states (e.g., bitterness, intoxication) and fails to distinguish subtle differences between related feelings (e.g., shame vs. embarrassment). Existing datasets also often use uncontrolled imagery with occluded faces and lack demographic diversity, risking significant bias. To address these critical gaps, we introduce EmoNet Face, a comprehensive benchmark suite. EmoNet Face features: (1) A novel 40-category emotion taxonomy, meticulously derived from foundational research to capture finer details of human emotional experiences. (2) Three large-scale, AI-generated datasets (EmoNet HQ, Binary, and Big) with explicit, full-face expressions and controlled demographic balance across ethnicity, age, and gender. (3) Rigorous, multi-expert annotations for training and high-fidelity evaluation. (4) We built EmpathicInsight-Face, a model achieving human-expert-level performance on our benchmark. The publicly released EmoNet Face suite - taxonomy, datasets, and model - provides a robust foundation for developing and evaluating AI systems with a deeper understanding of human emotions.

EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition

TL;DR

EmoNet-Face tackles the limited emotional repertoire and demographic biases of existing benchmarks by introducing a fine-grained 40-category emotion taxonomy grounded in established psychology. It delivers three synthetic, expert-annotated datasets—Big for pretraining, Binary for fine-tuning, and HQ for evaluation—featuring controlled demographic balance across ethnicity, age, and gender. The authors train EmpathicInsight-Face, a specialized model, achieving near-human performance on the HQ benchmark and demonstrate significant gaps in general-purpose VLMs for nuanced facial emotion recognition. By openly releasing the taxonomy, datasets, and models, the work provides a robust foundation for advancing emotion-aware AI while emphasizing ethical considerations and the need for multimodal integration in future research.

Abstract

Effective human-AI interaction relies on AI's ability to accurately perceive and interpret human emotions. Current benchmarks for vision and vision-language models are severely limited, offering a narrow emotional spectrum that overlooks nuanced states (e.g., bitterness, intoxication) and fails to distinguish subtle differences between related feelings (e.g., shame vs. embarrassment). Existing datasets also often use uncontrolled imagery with occluded faces and lack demographic diversity, risking significant bias. To address these critical gaps, we introduce EmoNet Face, a comprehensive benchmark suite. EmoNet Face features: (1) A novel 40-category emotion taxonomy, meticulously derived from foundational research to capture finer details of human emotional experiences. (2) Three large-scale, AI-generated datasets (EmoNet HQ, Binary, and Big) with explicit, full-face expressions and controlled demographic balance across ethnicity, age, and gender. (3) Rigorous, multi-expert annotations for training and high-fidelity evaluation. (4) We built EmpathicInsight-Face, a model achieving human-expert-level performance on our benchmark. The publicly released EmoNet Face suite - taxonomy, datasets, and model - provides a robust foundation for developing and evaluating AI systems with a deeper understanding of human emotions.

Paper Structure

This paper contains 38 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Samples from our EmoNet-Face datasets generated with different sota T2I models.
  • Figure 2: Approximate world map of demographic coverage and diversity in web-scraped datasets (left) compared to EmoNet-Face (right), which is much more diverse.
  • Figure 3: Weighted Kappa ($\kappa_w$) agreement scores by annotator group. A (top): Pairwise agreement between human annotators. B (below): Pairwise agreement between each human annotation and other sources, including 'Our Models' (EmpathicInsight-Face), 'Proprietary Models' (HumeFace), 'VLMs (Multi-Shot and Zero-Shot Prompts)', and a 'Random Baseline'. Each box represents the interquartile range (IQR) of $\kappa_w$ scores, with the median as center line.
  • Figure 4: Mean Spearman's Rho correlation between various model annotators and human annotations. Human ratings were median-aggregated per emotion before correlation with model ratings. The bar heights represent the mean of these per-emotion Spearman's Rho values, calculated across all emotions for each model. Error bars indicate bootstrap 95% confidence intervals (N=1000 bootstraps) for these means. Model annotator groups, including our trained models (EmpathicInsight-Face), VLMs with multi-shot or zero-shot prompting, proprietary models (HumeFace), and a random baseline, are distinguished by patterns as detailed in the legend
  • Figure 5: Rather than mapping each face to a single label, we estimate a distribution over plausible emotion categories.
  • ...and 11 more figures