Multi-modal Attention for Speech Emotion Recognition
Zexu Pan, Zhaojie Luo, Jichen Yang, Haizhou Li
TL;DR
This paper tackles speech emotion recognition by leveraging visual and textual cues through a hybrid fusion approach. It introduces MMAN, combining a multi-modal attention-based early-fusion sub-network (cLSTM-MMA) with three uni-modal sub-networks, and fuses their outputs in a late stage. The directional, cross-modal attention enables richer cross-modality interactions, achieving state-of-the-art 73.94% accuracy on IEMOCAP with substantially fewer parameters than competing methods. The results demonstrate the value of incorporating three modalities and the efficiency of the proposed attention mechanism for robust emotion recognition in realistic settings.
Abstract
Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.
