Table of Contents
Fetching ...

Bimodal Connection Attention Fusion for Speech Emotion Recognition

Jiachen Luo, Huy Phan, Lin Wang, Joshua D. Reiss

TL;DR

This work tackles bimodal speech emotion recognition by learning modality connections and intra-/inter-modal interactions between audio and text. It introduces Bimodal Connection Attention Fusion (BCAF), consisting of the Interactive Connection Network, the Bimodal Attention Network, and the Correlative Attention Network, plus uni-modal encoders (wav2vec for audio and RoBERTa for text). The model optimizes a composite loss that combines modality-specific and cross-modal objectives and achieves state-of-the-art performance on MELD and IEMOCAP. The results demonstrate improved robustness to cross-modal noise and richer cross-modal representations, enabling more accurate emotion recognition in conversations.

Abstract

Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.

Bimodal Connection Attention Fusion for Speech Emotion Recognition

TL;DR

This work tackles bimodal speech emotion recognition by learning modality connections and intra-/inter-modal interactions between audio and text. It introduces Bimodal Connection Attention Fusion (BCAF), consisting of the Interactive Connection Network, the Bimodal Attention Network, and the Correlative Attention Network, plus uni-modal encoders (wav2vec for audio and RoBERTa for text). The model optimizes a composite loss that combines modality-specific and cross-modal objectives and achieves state-of-the-art performance on MELD and IEMOCAP. The results demonstrate improved robustness to cross-modal noise and richer cross-modal representations, enabling more accurate emotion recognition in conversations.

Abstract

Multi-modal emotion recognition is challenging due to the difficulty of extracting features that capture subtle emotional differences. Understanding multi-modal interactions and connections is key to building effective bimodal speech emotion recognition systems. In this work, we propose Bimodal Connection Attention Fusion (BCAF) method, which includes three main modules: the interactive connection network, the bimodal attention network, and the correlative attention network. The interactive connection network uses an encoder-decoder architecture to model modality connections between audio and text while leveraging modality-specific features. The bimodal attention network enhances semantic complementation and exploits intra- and inter-modal interactions. The correlative attention network reduces cross-modal noise and captures correlations between audio and text. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed BCAF method outperforms existing state-of-the-art baselines.

Paper Structure

This paper contains 26 sections, 18 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Architecture of the proposed Bimodal Connection Attention Fusion (BCAF) method. The method consists of three modules: the unimodal representation module, the connection attention fusion module, and the classification module. The core connection attention fusion module includes the interactive connection network, the bimodal attention network, and the correlative attention network, with details depicted in Figs. 2, 3, 4 and 5. The unimodal audio representation $H_a$ and text representation $H_l$ are input into the interactive connection network, the correlative attention network, and the bimodal attention network.
  • Figure 2: Architecture of our proposed the Bimodal Connection Attention Fusion method. The method consists of three modules, the uni-modal representation, the connection attention fusion module and the classification module. The uni-modal audio representation $H_a$ and text representation $H_l$ are inputted into all the interactive connection network, correlative attention network and bimodal attention network. The core connection attention fusion module includes the interactive connection network, the bimodal attention network and correlative attention network, with the details depicted in Figs. 3-5, respectively.
  • Figure 3: Update scheme of the interactive connection network.
  • Figure 4: Update scheme of the bimodal attention network.
  • Figure 5: Update scheme of the correlative attention network consisting of the joint attention network and the bimodal correlation network.
  • ...and 4 more figures