EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection
Christoph Schuhmann, Robert Kaczmarczyk, Gollam Rabby, Felix Friedrich, Maurice Kraus, Kourosh Nadi, Huu Nguyen, Kristian Kersting, Sören Auer
TL;DR
SER has suffered from small, closed datasets with coarse emotion taxonomies and privacy concerns. EmoNet-Voice introduces EmoNet-Voice Big, a 5,000-hour multilingual synthetic corpus across 4 languages and 11 voices with 40 fine-grained emotions, and EmoNet-Voice Bench, an expert-annotated 12,600-sample benchmark, enabling privacy-preserving, large-scale SER research. Trained on EmoNet-Voice Big, EmpathicInsight-Voice achieves state-of-the-art fine-grained emotion estimation and demonstrates real-world generalization to EmoDB and RAVDESS despite a domain gap. The work reveals that high-arousal emotions are more readily detected, while subtle cognitive states remain challenging, and frames a scalable, ethically sourced paradigm for advancing nuanced emotion AI with strong cross-dataset generalization.
Abstract
Speech emotion recognition (SER) systems are constrained by existing datasets that typically cover only 6-10 basic emotions, lack scale and diversity, and face ethical challenges when collecting sensitive emotional states. We introduce EMONET-VOICE, a comprehensive resource addressing these limitations through two components: (1) EmoNet-Voice Big, a 5,000-hour multilingual pre-training dataset spanning 40 fine-grained emotion categories across 11 voices and 4 languages, and (2) EmoNet-Voice Bench, a rigorously validated benchmark of 4,7k samples with unanimous expert consensus on emotion presence and intensity levels. Using state-of-the-art synthetic voice generation, our privacy-preserving approach enables ethical inclusion of sensitive emotions (e.g., pain, shame) while maintaining controlled experimental conditions. Each sample underwent validation by three psychology experts. We demonstrate that our Empathic Insight models trained on our synthetic data achieve strong real-world dataset generalization, as tested on EmoDB and RAVDESS. Furthermore, our comprehensive evaluation reveals that while high-arousal emotions (e.g., anger: 95% accuracy) are readily detected, the benchmark successfully exposes the difficulty of distinguishing perceptually similar emotions (e.g., sadness vs. distress: 63% discrimination), providing quantifiable metrics for advancing nuanced emotion AI. EMONET-VOICE establishes a new paradigm for large-scale, ethically-sourced, fine-grained SER research.
