EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, Thomas Hain
TL;DR
EmoBox tackles the lack of universal SER benchmarks by delivering a multilingual, multi-corpus toolkit and benchmark for intra- and cross-corpus evaluation. It standardizes data processing, partitions data with dataset-aware rules, and refines cross-corpus test labels using emotion2vec to create balanced test sets across 14 languages and 32 datasets. The paper benchmarks 10 pre-trained models (including Whisper, WavLM, HuBERT, and data2vec variants) on UA, WA, and Macro F1, finding Whisper large v3 often strongest and highlighting the benefits of large, multilingual SSL models for SER. It introduces the largest SER benchmark to date and provides partitioning guidelines to improve reproducibility, with a formal cross-corpus accuracy definition to quantify generalization across corpora. The work offers practical impact for researchers needing robust, cross-language SER evaluation and sets the stage for more comprehensive analyses of linguistic and channel robustness in emotion recognition.
Abstract
Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers numerous corpus and languages for researchers to refer to, making reproduction a burden. In this paper, we propose EmoBox, an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings. For intra-corpus settings, we carefully designed the data partitioning for different datasets. For cross-corpus settings, we employ a foundation SER model, emotion2vec, to mitigate annotation errors and obtain a test set that is fully balanced in speakers and emotions distributions. Based on EmoBox, we present the intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets with 14 languages, and the cross-corpus SER results on 4 datasets with the fully balanced test sets. To the best of our knowledge, this is the largest SER benchmark, across language scopes and quantity scales. We hope that our toolkit and benchmark can facilitate the research of SER in the community.
