How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Marc Casals-Salvador; Federico Costa; Rodolfo Zevallos; Javier Hernando

How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Marc Casals-Salvador, Federico Costa, Rodolfo Zevallos, Javier Hernando

Abstract

Speech Emotion Recognition (SER) plays a key role in advancing human-computer interaction. Attention mechanisms have become the dominant approach for modeling emotional speech due to their ability to capture long-range dependencies and emphasize salient information. However, standard self-attention suffers from quadratic computational and memory complexity, limiting its scalability. In this work, we present a systematic benchmark of optimized attention mechanisms for SER, including RetNet, LightNet, GSA, FoX, and KDA. Experiments on both MSP-Podcast benchmark versions show that while standard self-attention achieves the strongest recognition performance across test sets, efficient attention variants dramatically improve scalability, reducing inference latency and memory usage by up to an order of magnitude. These results highlight a critical trade-off between accuracy and efficiency, providing practical insights for designing scalable SER systems.

How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Abstract

Paper Structure (15 sections, 2 equations, 2 figures, 1 table)

This paper contains 15 sections, 2 equations, 2 figures, 1 table.

Introduction
Related Work
Methodology
Overview
Proposed Architecture
Sequence-to-sequence alternatives
Experimental Setup
Dataset
Feature Extraction
Training Details
Evaluation Details
Experimental Results
Discussion
Acknowledgments
Generative AI Use Disclosure

Figures (2)

Figure 1: System's architecture. Experiments are made considering different attention mechanisms for the seq2seq module.
Figure 2: Inference time and peak GPU memory usage of the seq2seq module as a function of sequence length on the MSP-Podcast dev set 8003425. Panels (a–b) report results for all models. Panels (c–d) provide a zoomed view excluding SA to make the relative growth trends of the remaining alternatives easier to distinguish.

How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Abstract

How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Authors

Abstract

Table of Contents

Figures (2)