Table of Contents
Fetching ...

HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

Soumya Dutta, Sriram Ganapathy

TL;DR

The paper tackles emotion recognition in conversations by leveraging a hierarchical, cross-attention architecture (HCAM) that jointly models audio and text modalities. It introduces three training stages: stage I learns uni-modal utterance embeddings, stage II injects inter-utterance contextual information with a contextual GRU and self-attention, and stage III fuses modalities through a co-attention mechanism with cross-attention in both directions. A supervised contrastive loss is combined with cross-entropy to guide learning, and test-time ensembling further boosts performance. On IEMOCAP, MELD, and CMU-MOSI, HCAM achieves state-of-the-art results, demonstrating robustness to ASR-induced text noise and clear benefits from hierarchical context modeling and multimodal fusion, with practical impact for robust ERC in real-world dialog systems.

Abstract

Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.

HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

TL;DR

The paper tackles emotion recognition in conversations by leveraging a hierarchical, cross-attention architecture (HCAM) that jointly models audio and text modalities. It introduces three training stages: stage I learns uni-modal utterance embeddings, stage II injects inter-utterance contextual information with a contextual GRU and self-attention, and stage III fuses modalities through a co-attention mechanism with cross-attention in both directions. A supervised contrastive loss is combined with cross-entropy to guide learning, and test-time ensembling further boosts performance. On IEMOCAP, MELD, and CMU-MOSI, HCAM achieves state-of-the-art results, demonstrating robustness to ASR-induced text noise and clear benefits from hierarchical context modeling and multimodal fusion, with practical impact for robust ERC in real-world dialog systems.

Abstract

Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.
Paper Structure (39 sections, 6 equations, 6 figures, 11 tables)

This paper contains 39 sections, 6 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Block diagram of the proposed model. Here, $S_1$, $S_2$ and $S_3$ refer to the speech utterances in a conversation. Similarly, the $T_1$, $T_2$ and $T_3$ refer to the text transcripts for the corresponding speech signals. $\hat{Y_1}$, $\hat{Y_2}$ and $\hat{Y_3}$ refer to the predicted emotion labels for the three utterances. The three stages of training are also marked in the diagram.
  • Figure 2: Block diagram of the contextual GRU with self-attention. Here, $U_{T}$, $U_{T \pm 1}$ and $U_{T \pm 2}$ refer to the uni-modal embeddings from stage I of the model for each utterance in the conversation.
  • Figure 3: The co-attention network used in the proposed model. It consists of two sub-blocks - the cross-attention and the self-attention blocks.
  • Figure 4: Confusion matrices for the different stages of our model when run on IEMOCAP dataset with 6 classes. Abbreviations used: Happy:Hap., Neutral:Neu., Angry:Ang., Excited:Exc., Frustrated:Fru.
  • Figure 5: Variation of the test performance with change in $\beta$ (Eq.\ref{['lossfinal']}) and temperature parameter in the sup-con loss for IEMOCAP $4$-way and CMU-MOSI datasets.
  • ...and 1 more figures