MFHCA: Enhancing Speech Emotion Recognition Via Multi-Spatial Fusion and Hierarchical Cooperative Attention
Xinxin Jiao, Liejun Wang, Yinfeng Yu
TL;DR
The paper tackles speech emotion recognition (SER) by addressing the challenge of efficiently leveraging emotional cues in speech. It introduces MFHCA, which integrates a Multi-Spatial Fusion (MF) module with Global Receptive Field blocks and a Hierarchical Cooperative Attention (HCA) mechanism, fusing log-Mel spectrograms with Hubert-derived features. The approach achieves state-of-the-art results on IEMOCAP, with improvements of $2.6\%$ in WA and $1.87\%$ in UA, while maintaining a lightweight parameter footprint. This work demonstrates that combining multi-scale, spatially aware spectrogram features with self-supervised Hubert representations via hierarchical co-attention yields robust, efficient SER suitable for real-world HCI deployments.
Abstract
Speech emotion recognition is crucial in human-computer interaction, but extracting and using emotional cues from audio poses challenges. This paper introduces MFHCA, a novel method for Speech Emotion Recognition using Multi-Spatial Fusion and Hierarchical Cooperative Attention on spectrograms and raw audio. We employ the Multi-Spatial Fusion module (MF) to efficiently identify emotion-related spectrogram regions and integrate Hubert features for higher-level acoustic information. Our approach also includes a Hierarchical Cooperative Attention module (HCA) to merge features from various auditory levels. We evaluate our method on the IEMOCAP dataset and achieve 2.6\% and 1.87\% improvements on the weighted accuracy and unweighted accuracy, respectively. Extensive experiments demonstrate the effectiveness of the proposed method.
