Table of Contents
Fetching ...

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

Cong Wang, Yizhong Geng, Yuhua Wen, Qifei Li, Yingming Gao, Ruimin Wang, Chunfeng Wang, Hao Li, Ya Li, Wei Chen

TL;DR

SER is challenged by emotional ambiguity and limited labeled data. The authors propose a three-component framework combining energy-adaptive mixup (EAM), frame-level attention (FLAM), and multi-loss learning (MLL) with context broadcasting to enhance data diversity, temporal cue extraction, and feature discrimination. The approach yields state-of-the-art results on four benchmark datasets (IEMOCAP, MSP-IMPROV, RAVDESS, SAVEE), demonstrating strong generalization across spontaneous and acted speech. This work offers a robust, energy-aware mechanism for SER with potential applicability to real-world HCI scenarios and sets the stage for cross-linguistic and multi-modal extensions.

Abstract

Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.

Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention

TL;DR

SER is challenged by emotional ambiguity and limited labeled data. The authors propose a three-component framework combining energy-adaptive mixup (EAM), frame-level attention (FLAM), and multi-loss learning (MLL) with context broadcasting to enhance data diversity, temporal cue extraction, and feature discrimination. The approach yields state-of-the-art results on four benchmark datasets (IEMOCAP, MSP-IMPROV, RAVDESS, SAVEE), demonstrating strong generalization across spontaneous and acted speech. This work offers a robust, energy-aware mechanism for SER with potential applicability to real-world HCI scenarios and sets the stage for cross-linguistic and multi-modal extensions.

Abstract

Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.

Paper Structure

This paper contains 12 sections, 12 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall model architecture of our proposed SER method.
  • Figure 2: t-SNE visualizations of feature distributions on IEMOCAP. (a) Training set before MLL; (b) Test set before MLL; (c) Training set after MLL; (d) Test set after MLL. Colors: Blue–Angry, Orange–Happy, Green–Sad, Red–Neutral. The feature clusters are visibly more distinct after our MLL strategy.