What Does it Take to Generalize SER Model Across Datasets? A Comprehensive Benchmark
Adham Ibrahim, Shady Shehata, Ajinkya Kulkarni, Mukhtar Mohamed, Muhammad Abdul-Mageed
TL;DR
This paper tackles cross-dataset generalization in speech emotion recognition (SER) by benchmarking a Whisper-based SER system across 11 datasets under leave-one-speaker-out evaluation. It combines diverse English datasets, addresses severe class imbalance with SMOTE and ADASYN oversampling, and explores 4-emotion and 5-emotion label configurations. The approach uses Whisper latent features fed to a five-layer feed-forward classifier trained with cross-entropy, enabling speaker-independent generalization. Results indicate that dataset combination and targeted oversampling improve cross-dataset performance, though gains vary by dataset; the work provides a practical pathway toward more robust, real-world SER systems and highlights directions for low-resource language adaptation.
Abstract
Speech emotion recognition (SER) is essential for enhancing human-computer interaction in speech-based applications. Despite improvements in specific emotional datasets, there is still a research gap in SER's capability to generalize across real-world situations. In this paper, we investigate approaches to generalize the SER system across different emotion datasets. In particular, we incorporate 11 emotional speech datasets and illustrate a comprehensive benchmark on the SER task. We also address the challenge of imbalanced data distribution using over-sampling methods when combining SER datasets for training. Furthermore, we explore various evaluation protocols for adeptness in the generalization of SER. Building on this, we explore the potential of Whisper for SER, emphasizing the importance of thorough evaluation. Our approach is designed to advance SER technology by integrating speaker-independent methods.
