Table of Contents
Fetching ...

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer

TL;DR

SER has suffered from small, closed datasets with coarse emotion taxonomies and privacy concerns. EmoNet-Voice introduces EmoNet-Voice Big, a 5,000-hour multilingual synthetic corpus across 4 languages and 11 voices with 40 fine-grained emotions, and EmoNet-Voice Bench, an expert-annotated 12,600-sample benchmark, enabling privacy-preserving, large-scale SER research. Trained on EmoNet-Voice Big, EmpathicInsight-Voice achieves state-of-the-art fine-grained emotion estimation and demonstrates real-world generalization to EmoDB and RAVDESS despite a domain gap. The work reveals that high-arousal emotions are more readily detected, while subtle cognitive states remain challenging, and frames a scalable, ethically sourced paradigm for advancing nuanced emotion AI with strong cross-dataset generalization.

Abstract

Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting sensitive emotional states. We introduce EMONET-VOICE, a comprehensive resource addressing these limitations through two components: (1) EmoNet-Voice Big, a 5,000-hour multilingual pre-training dataset spanning 40 fine-grained emotion categories across 11 voices and 4 languages, and (2) EmoNet-Voice Bench, a rigorously validated benchmark of 4,7k samples with unanimous expert consensus on emotion presence and intensity levels. Using state-of-the-art synthetic voice generation, our privacy-preserving approach enables ethical inclusion of sensitive emotions (e.g., pain, shame) while maintaining controlled experimental conditions. Each sample underwent validation by three psychology experts. We demonstrate that our Empathic Insight models trained on our synthetic data achieve strong real-world dataset generalization, as tested on EmoDB and RAVDESS. Furthermore, our comprehensive evaluation reveals that while high-arousal emotions (e.g., anger: 95% accuracy) are readily detected, the benchmark successfully exposes the difficulty of distinguishing perceptually similar emotions (e.g., sadness vs. distress: 63% discrimination), providing quantifiable metrics for advancing nuanced emotion AI. EMONET-VOICE establishes a new paradigm for large-scale, ethically-sourced, fine-grained SER research.

EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

TL;DR

SER has suffered from small, closed datasets with coarse emotion taxonomies and privacy concerns. EmoNet-Voice introduces EmoNet-Voice Big, a 5,000-hour multilingual synthetic corpus across 4 languages and 11 voices with 40 fine-grained emotions, and EmoNet-Voice Bench, an expert-annotated 12,600-sample benchmark, enabling privacy-preserving, large-scale SER research. Trained on EmoNet-Voice Big, EmpathicInsight-Voice achieves state-of-the-art fine-grained emotion estimation and demonstrates real-world generalization to EmoDB and RAVDESS despite a domain gap. The work reveals that high-arousal emotions are more readily detected, while subtle cognitive states remain challenging, and frames a scalable, ethically sourced paradigm for advancing nuanced emotion AI with strong cross-dataset generalization.

Abstract

Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting sensitive emotional states. We introduce EMONET-VOICE, a comprehensive resource addressing these limitations through two components: (1) EmoNet-Voice Big, a 5,000-hour multilingual pre-training dataset spanning 40 fine-grained emotion categories across 11 voices and 4 languages, and (2) EmoNet-Voice Bench, a rigorously validated benchmark of 4,7k samples with unanimous expert consensus on emotion presence and intensity levels. Using state-of-the-art synthetic voice generation, our privacy-preserving approach enables ethical inclusion of sensitive emotions (e.g., pain, shame) while maintaining controlled experimental conditions. Each sample underwent validation by three psychology experts. We demonstrate that our Empathic Insight models trained on our synthetic data achieve strong real-world dataset generalization, as tested on EmoDB and RAVDESS. Furthermore, our comprehensive evaluation reveals that while high-arousal emotions (e.g., anger: 95% accuracy) are readily detected, the benchmark successfully exposes the difficulty of distinguishing perceptually similar emotions (e.g., sadness vs. distress: 63% discrimination), providing quantifiable metrics for advancing nuanced emotion AI. EMONET-VOICE establishes a new paradigm for large-scale, ethically-sourced, fine-grained SER research.

Paper Structure

This paper contains 24 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Expert annotator agreement on perceived emotions. Stacked bars show the proportion of audio-emotion instances by agreement type, from unanimous agreement on presence (e.g., '3:0 (+)') to disagreement ('1:1') and unanimous agreement on absence ('3:0 (-)'). Numbers to the right indicate total instances (n) per emotion and the distribution of raters (%2r, %3r). The patterns reveal high consensus for acoustically salient emotions like concentration but significant ambiguity for nuanced states like awe, underscoring the challenge of fine-grained SER.
  • Figure 2: Instructions given to the human annotator for the expert annotation of EmoNet-Voice Bench.
  • Figure 3: UI of our expert annotation tool for EmoNet-Voice Bench.