SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition
Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem
TL;DR
This work tackles the generalization gap in Speech Emotion Recognition by introducing a large-scale, multilingual benchmark that evaluates models in both in-domain and out-of-domain settings, with a focus on zero-shot transfer. It assesses a broad set of backbones, including SSL models and OpenAI Whisper variants, using a compact MLP classifier and logit-adjusted, frame-level predictions across curated datasets remapped to eight emotion classes. The key finding is that Whisper-based encoders often outperform dedicated SSL SER models in cross-lingual scenarios, challenging assumptions about ASR-focused models for emotion tasks, while also revealing substantial cross-dataset performance variability. The benchmark and open-source resources provided aim to drive development of more robust, generalizable cross-lingual SER systems for real-world deployment.
Abstract
Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER models, and our benchmark serves as a valuable resource to drive future research in this direction.
