Table of Contents
Fetching ...

EmoHRNet: High-Resolution Neural Network Based Speech Emotion Recognition

Akshay Muppidi, Martin Radfar

TL;DR

This work tackles SER by reframing it with a high-resolution network approach. EmoHRNet adapts HRNet to process Mel-spectrograms, preserving high-resolution representations across multiple scales via parallel stages and a Fuse Layer, and uses SpecAugment-enhanced data augmentation. The model is trained with cross-entropy loss and the Adam optimizer, achieving state-of-the-art accuracies of $92.45\%$ on RAVDESS, $80.06\%$ on IEMOCAP, and $92.77\%$ on EMOVO, outperforming prior methods. The results suggest that maintaining high-resolution, multi-scale features is particularly beneficial for capturing nuanced emotional cues in speech, with implications for real-time SER applications.

Abstract

Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.

EmoHRNet: High-Resolution Neural Network Based Speech Emotion Recognition

TL;DR

This work tackles SER by reframing it with a high-resolution network approach. EmoHRNet adapts HRNet to process Mel-spectrograms, preserving high-resolution representations across multiple scales via parallel stages and a Fuse Layer, and uses SpecAugment-enhanced data augmentation. The model is trained with cross-entropy loss and the Adam optimizer, achieving state-of-the-art accuracies of on RAVDESS, on IEMOCAP, and on EMOVO, outperforming prior methods. The results suggest that maintaining high-resolution, multi-scale features is particularly beneficial for capturing nuanced emotional cues in speech, with implications for real-time SER applications.

Abstract

Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.

Paper Structure

This paper contains 14 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: The Original Mel-Spectrogram, The Distorted SpecAugment Mel-Spectrogram, and The Difference Between Orginal and Augmented Mel-Spectrograms.
  • Figure 2: EmoHRNet Model Architecture: Input, High Resolution Stages, Fuse Layer, and Fully Connected Layers.