Table of Contents
Fetching ...

MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative Attention

Xinxin Jiao, Liejun Wang, Yinfeng Yu

TL;DR

The paper tackles speech emotion recognition (SER) by addressing the challenge of efficiently leveraging emotional cues in speech. It introduces MFHCA, which integrates a Multi-Spatial Fusion (MF) module with Global Receptive Field blocks and a Hierarchical Cooperative Attention (HCA) mechanism, fusing log-Mel spectrograms with Hubert-derived features. The approach achieves state-of-the-art results on IEMOCAP, with improvements of $2.6\%$ in WA and $1.87\%$ in UA, while maintaining a lightweight parameter footprint. This work demonstrates that combining multi-scale, spatially aware spectrogram features with self-supervised Hubert representations via hierarchical co-attention yields robust, efficient SER suitable for real-world HCI deployments.

Abstract

Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.

MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative Attention

TL;DR

The paper tackles speech emotion recognition (SER) by addressing the challenge of efficiently leveraging emotional cues in speech. It introduces MFHCA, which integrates a Multi-Spatial Fusion (MF) module with Global Receptive Field blocks and a Hierarchical Cooperative Attention (HCA) mechanism, fusing log-Mel spectrograms with Hubert-derived features. The approach achieves state-of-the-art results on IEMOCAP, with improvements of in WA and in UA, while maintaining a lightweight parameter footprint. This work demonstrates that combining multi-scale, spatially aware spectrogram features with self-supervised Hubert representations via hierarchical co-attention yields robust, efficient SER suitable for real-world HCI deployments.

Abstract

Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.
Paper Structure (11 sections, 11 equations, 3 figures, 4 tables)

This paper contains 11 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture of our proposed method.
  • Figure 2: Global Receptive Field block
  • Figure 3: The t-SNE visualization of feature distribution. (a) and (b) are the final combined features without and with the proposed HCA.