Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

Pedro Corrêa; João Lima; Victor Moreno; Lucas Ueda; Paula Dornhofer Paro Costa

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

Pedro Corrêa, João Lima, Victor Moreno, Lucas Ueda, Paula Dornhofer Paro Costa

TL;DR

The paper investigates whether spoken language models truly integrate acoustic prosody with semantic content by evaluating emotion recognition on emotionally incongruent speech. Using an LLM-generated set of emotion-rich sentences and three expressive TTS systems, the authors construct EMIS to contrast congruent and incongruent speech and test four SLMs against a baseline SER and human judgments. They find that SLMs largely rely on semantic content, achieving near-random accuracy on target emotions while proxy semantics dominate, especially with explicit content, highlighting a limited capacity to disentangle modalities. The work provides EMIS and code as a resource for benchmarking paralinguistic understanding in audio-language models and exposes a critical gap in current SLMs for nuanced emotion and prosody processing.

Abstract

Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

TL;DR

Abstract

Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)