Table of Contents
Fetching ...

EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

TL;DR

This work tackles speech emotion recognition by reframing the continuous VAD space as a sphere and introducing an auxiliary spherical-region classification task to guide VAD regression. The EmoSphere-SER framework combines a pre-trained SSL encoder, a style pooling layer, and dual prediction heads, trained with a decaying auxiliary loss that aligns region predictions with continuous VAD values. Empirical results on MSP-Podcast show consistent gains over baselines and well-behaved ablations, with the spherical-region auxiliary loss and dynamic weighting delivering notable improvements in VAD accuracy and stability. The approach demonstrates the value of structured, geometry-inspired representations for robust affective prediction and suggests avenues for extending spherical partitioning to other affective computing tasks.

Abstract

Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework.

EmoSphere-SER: Enhancing Speech Emotion Recognition Through Spherical Representation with Auxiliary Classification

TL;DR

This work tackles speech emotion recognition by reframing the continuous VAD space as a sphere and introducing an auxiliary spherical-region classification task to guide VAD regression. The EmoSphere-SER framework combines a pre-trained SSL encoder, a style pooling layer, and dual prediction heads, trained with a decaying auxiliary loss that aligns region predictions with continuous VAD values. Empirical results on MSP-Podcast show consistent gains over baselines and well-behaved ablations, with the spherical-region auxiliary loss and dynamic weighting delivering notable improvements in VAD accuracy and stability. The approach demonstrates the value of structured, geometry-inspired representations for robust affective prediction and suggests avenues for extending spherical partitioning to other affective computing tasks.

Abstract

Speech emotion recognition predicts a speaker's emotional state from speech signals using discrete labels or continuous dimensions such as arousal, valence, and dominance (VAD). We propose EmoSphere-SER, a joint model that integrates spherical VAD region classification to guide VAD regression for improved emotion prediction. In our framework, VAD values are transformed into spherical coordinates that are divided into multiple spherical regions, and an auxiliary classification task predicts which spherical region each point belongs to, guiding the regression process. Additionally, we incorporate a dynamic weighting scheme and a style pooling layer with multi-head self-attention to capture spectral and temporal dynamics, further boosting performance. This combined training strategy reinforces structured learning and improves prediction consistency. Experimental results show that our approach exceeds baseline methods, confirming the validity of the proposed framework.

Paper Structure

This paper contains 21 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overall framework of EmoSphere-SER