Table of Contents
Fetching ...

SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

Mohamed Osman, Daniel Z. Kaplan, Tamer Nadeem

TL;DR

This work tackles the generalization gap in Speech Emotion Recognition by introducing a large-scale, multilingual benchmark that evaluates models in both in-domain and out-of-domain settings, with a focus on zero-shot transfer. It assesses a broad set of backbones, including SSL models and OpenAI Whisper variants, using a compact MLP classifier and logit-adjusted, frame-level predictions across curated datasets remapped to eight emotion classes. The key finding is that Whisper-based encoders often outperform dedicated SSL SER models in cross-lingual scenarios, challenging assumptions about ASR-focused models for emotion tasks, while also revealing substantial cross-dataset performance variability. The benchmark and open-source resources provided aim to drive development of more robust, generalizable cross-lingual SER systems for real-world deployment.

Abstract

Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER models, and our benchmark serves as a valuable resource to drive future research in this direction.

SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition

TL;DR

This work tackles the generalization gap in Speech Emotion Recognition by introducing a large-scale, multilingual benchmark that evaluates models in both in-domain and out-of-domain settings, with a focus on zero-shot transfer. It assesses a broad set of backbones, including SSL models and OpenAI Whisper variants, using a compact MLP classifier and logit-adjusted, frame-level predictions across curated datasets remapped to eight emotion classes. The key finding is that Whisper-based encoders often outperform dedicated SSL SER models in cross-lingual scenarios, challenging assumptions about ASR-focused models for emotion tasks, while also revealing substantial cross-dataset performance variability. The benchmark and open-source resources provided aim to drive development of more robust, generalizable cross-lingual SER systems for real-world deployment.

Abstract

Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER models, and our benchmark serves as a valuable resource to drive future research in this direction.
Paper Structure (15 sections, 1 equation, 3 figures, 3 tables)

This paper contains 15 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our benchmark's methodology.
  • Figure 2: Average out-of-domain performance given the training dataset.
  • Figure 3: Average performance on individual datasets when not trained on them.