Table of Contents
Fetching ...

Multi-modal Attention for Speech Emotion Recognition

Zexu Pan, Zhaojie Luo, Jichen Yang, Haizhou Li

TL;DR

This paper tackles speech emotion recognition by leveraging visual and textual cues through a hybrid fusion approach. It introduces MMAN, combining a multi-modal attention-based early-fusion sub-network (cLSTM-MMA) with three uni-modal sub-networks, and fuses their outputs in a late stage. The directional, cross-modal attention enables richer cross-modality interactions, achieving state-of-the-art 73.94% accuracy on IEMOCAP with substantially fewer parameters than competing methods. The results demonstrate the value of incorporating three modalities and the efficiency of the proposed attention mechanism for robust emotion recognition in realistic settings.

Abstract

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

Multi-modal Attention for Speech Emotion Recognition

TL;DR

This paper tackles speech emotion recognition by leveraging visual and textual cues through a hybrid fusion approach. It introduces MMAN, combining a multi-modal attention-based early-fusion sub-network (cLSTM-MMA) with three uni-modal sub-networks, and fuses their outputs in a late stage. The directional, cross-modal attention enables richer cross-modality interactions, achieving state-of-the-art 73.94% accuracy on IEMOCAP with substantially fewer parameters than competing methods. The results demonstrate the value of incorporating three modalities and the efficiency of the proposed attention mechanism for robust emotion recognition in realistic settings.

Abstract

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

Paper Structure

This paper contains 19 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: On the left panel is the proposed multi-modal attention network (MMAN). It consists of a multi-modal attention sub-network (cLSTM-MMA) for early fusion and three uni-modal sub-networks cLSTM-Text, cLSTM-Visual and cLSTM-Speech. The predictions of the four sub-networks are fused with a dense and a softmax layer in late fusion. The architecture of the cLSTM-MMA sub-network is shown in the red dotted box on the right panel. The symbol $\oplus$ represents concatenation and S, V, T represents speech, visual and text respectively. The cLSTM-MMA consists of three independent dense layers for uni-modal feature embeddings standardisation, multi-modal attention with three parallel directional multi-modal attention modules and finally a cLSTM with one LSTM layer inside.
  • Figure 2: The details of the directional multi-modal attention module $S\xrightarrow{}(S,V,T)$ with query from speech. The inputs to this module are the uni-modal feature embeddings ($\hat{s}_i, \hat{v}_i, \hat{t}_i$) after the standardization dense layers
  • Figure 3: Normalised confusion matrix of the Speech-only baseline cLSTM-Speech and proposed cLSTM-MMA network. Diagonal entries represent the recall rates of each emotion.