Table of Contents
Fetching ...

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

Hazim Bukhari, Soham Deshmukh, Hira Dhamyal, Bhiksha Raj, Rita Singh

TL;DR

This work reframes Speech Emotion Recognition (SER) as conditional text-token generation, addressing poor Out-of-Domain (OOD) generalization by leveraging an audio-conditioned language model (SELM). SELM fuses a frozen Wav2Vec2 encoder with a GPT-2-based LM, using learnable audio and text mappers to produce emotion tokens that are mapped to discrete classes, trained on 315k triplets spanning categorical, sentiment, and dimensional views. Evaluated on three unseen datasets (RAVDESS, CREMA-D, IEMOCAP), SELM achieves significant improvements in OOD settings and benefits from Few-Shot Learning, demonstrating robust cross-domain emotion understanding. The approach highlights the value of decomposing SER into acoustic modeling plus language-model reweighting, offering practical gains for in-the-wild emotion recognition and adaptable deployment with limited labeled data.

Abstract

Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER task as generating the most likely sequence of text tokens to infer emotion. The formulation breaks SER into predicting acoustic model features weighted by language model prediction. As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. We train SELM on curated speech emotion corpus and test it on three OOD datasets (RAVDESS, CREMAD, IEMOCAP) not used in training. SELM achieves significant improvements over the state-of-the-art baselines, with 17% and 7% relative accuracy gains for RAVDESS and CREMA-D, respectively. Moreover, SELM can further boost its performance by Few-Shot Learning using a few annotated examples. The results highlight the effectiveness of our SER formulation, especially to improve performance in OOD scenarios.

SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios

TL;DR

This work reframes Speech Emotion Recognition (SER) as conditional text-token generation, addressing poor Out-of-Domain (OOD) generalization by leveraging an audio-conditioned language model (SELM). SELM fuses a frozen Wav2Vec2 encoder with a GPT-2-based LM, using learnable audio and text mappers to produce emotion tokens that are mapped to discrete classes, trained on 315k triplets spanning categorical, sentiment, and dimensional views. Evaluated on three unseen datasets (RAVDESS, CREMA-D, IEMOCAP), SELM achieves significant improvements in OOD settings and benefits from Few-Shot Learning, demonstrating robust cross-domain emotion understanding. The approach highlights the value of decomposing SER into acoustic modeling plus language-model reweighting, offering practical gains for in-the-wild emotion recognition and adaptable deployment with limited labeled data.

Abstract

Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER task as generating the most likely sequence of text tokens to infer emotion. The formulation breaks SER into predicting acoustic model features weighted by language model prediction. As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. We train SELM on curated speech emotion corpus and test it on three OOD datasets (RAVDESS, CREMAD, IEMOCAP) not used in training. SELM achieves significant improvements over the state-of-the-art baselines, with 17% and 7% relative accuracy gains for RAVDESS and CREMA-D, respectively. Moreover, SELM can further boost its performance by Few-Shot Learning using a few annotated examples. The results highlight the effectiveness of our SER formulation, especially to improve performance in OOD scenarios.
Paper Structure (16 sections, 8 equations, 2 figures, 4 tables)

This paper contains 16 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: SELM: Speech Emotion Language Model. The model is fed with audio and input prompts, which get independently encoded by audio projection & audio mapper, and text embedder. The encoded audio and text is used to prompt a Language Model. In the figure, the input text prompt is "this person is feeling" and SELM outputs "emotion of happy". The audio projection and audio mapper are learned during training while the Language Model and Wav2vec2 are frozen.
  • Figure 2: Ablation study on the parameters to update for Few-Shot Learning. The first graph and second graph show performance improvement achieved by finetuning different parts of the model under 4-shot and 8-shot settings for the three datasets.