Bimodal Connection Attention Fusion for Speech Emotion Recognition
Jiachen Luo, Huy Phan, Lin Wang, Joshua D. Reiss
TL;DR
This work tackles bimodal speech emotion recognition by learning modality connections and intra-/inter-modal interactions between audio and text. It introduces Bimodal Connection Attention Fusion (BCAF), consisting of the Interactive Connection Network, the Bimodal Attention Network, and the Correlative Attention Network, plus uni-modal encoders (wav2vec for audio and RoBERTa for text). The model optimizes a composite loss that combines modality-specific and cross-modal objectives and achieves state-of-the-art performance on MELD and IEMOCAP. The results demonstrate improved robustness to cross-modal noise and richer cross-modal representations, enabling more accurate emotion recognition in conversations.
Abstract
Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.
