Table of Contents
Fetching ...

WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

Feng Li, Jiusong Luo, Wanjun Xia

TL;DR

The paper tackles speech emotion recognition by leveraging multi-modal cues (audio, text, visual) to address limitations of unimodal SER. It proposes WavFusion, which uses wav2vec 2.0 as the audio backbone and a gated cross-modal attention mechanism to fuse modalities, complemented by an A-GRU-LVC visual encoder. Discriminative cross-modal representations are enforced via multimodal homogeneous feature discrepancy learning with a margin loss $L_{mar}$ and a total loss $L_{total}^{emotion}=L_{task}^{emotion}+\lambda L_{mar}$. Experiments on IEMOCAP and MELD show state-of-the-art results, underscoring the importance of capturing nuanced cross-modal interactions for robust SER.

Abstract

Speech emotion recognition (SER) remains a challenging yet crucial task due to the inherent complexity and diversity of human emotions. To address this problem, researchers attempt to fuse information from other modalities via multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of cross-modal interactions, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal speech emotion recognition framework that addresses critical research problems in effective multimodal fusion, heterogeneity among modalities, and discriminative representation learning. By leveraging a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion demonstrates improved performance over existing state-of-the-art methods on benchmark datasets. Our work highlights the importance of capturing nuanced cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results on two benchmark datasets (IEMOCAP and MELD) demonstrate that WavFusion succeeds over the state-of-the-art strategies on emotion recognition.

WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition

TL;DR

The paper tackles speech emotion recognition by leveraging multi-modal cues (audio, text, visual) to address limitations of unimodal SER. It proposes WavFusion, which uses wav2vec 2.0 as the audio backbone and a gated cross-modal attention mechanism to fuse modalities, complemented by an A-GRU-LVC visual encoder. Discriminative cross-modal representations are enforced via multimodal homogeneous feature discrepancy learning with a margin loss and a total loss . Experiments on IEMOCAP and MELD show state-of-the-art results, underscoring the importance of capturing nuanced cross-modal interactions for robust SER.

Abstract

Speech emotion recognition (SER) remains a challenging yet crucial task due to the inherent complexity and diversity of human emotions. To address this problem, researchers attempt to fuse information from other modalities via multimodal learning. However, existing multimodal fusion techniques often overlook the intricacies of cross-modal interactions, resulting in suboptimal feature representations. In this paper, we propose WavFusion, a multimodal speech emotion recognition framework that addresses critical research problems in effective multimodal fusion, heterogeneity among modalities, and discriminative representation learning. By leveraging a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning, WavFusion demonstrates improved performance over existing state-of-the-art methods on benchmark datasets. Our work highlights the importance of capturing nuanced cross-modal interactions and learning discriminative representations for accurate multimodal SER. Experimental results on two benchmark datasets (IEMOCAP and MELD) demonstrate that WavFusion succeeds over the state-of-the-art strategies on emotion recognition.

Paper Structure

This paper contains 15 sections, 16 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: The overview of WavFusion.