SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios
Hazim Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh
TL;DR
This work reframes Speech Emotion Recognition (SER) as conditional text-token generation, addressing poor Out-of-Domain (OOD) generalization by leveraging an audio-conditioned language model (SELM). SELM fuses a frozen Wav2Vec2 encoder with a GPT-2-based LM, using learnable audio and text mappers to produce emotion tokens that are mapped to discrete classes, trained on 315k triplets spanning categorical, sentiment, and dimensional views. Evaluated on three unseen datasets (RAVDESS, CREMA-D, IEMOCAP), SELM achieves significant improvements in OOD settings and benefits from Few-Shot Learning, demonstrating robust cross-domain emotion understanding. The approach highlights the value of decomposing SER into acoustic modeling plus language-model reweighting, offering practical gains for in-the-wild emotion recognition and adaptable deployment with limited labeled data.
Abstract
Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER task as generating the most likely sequence of text tokens to infer emotion. The formulation breaks SER into predicting acoustic model features weighted by language model prediction. As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. We train SELM on curated speech emotion corpus and test it on three OOD datasets (RAVDESS, CREMAD, IEMOCAP) not used in training. SELM achieves significant improvements over the state-of-the-art baselines, with 17% and 7% relative accuracy gains for RAVDESS and CREMA-D, respectively. Moreover, SELM can further boost its performance by Few-Shot Learning using a few annotated examples. The results highlight the effectiveness of our SER formulation, especially to improve performance in OOD scenarios.
