Table of Contents
Fetching ...

A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning

Wei Ai, FuChen Zhang, Tao Meng, YunTao Shou, HongEn Shao, Keqin Li

TL;DR

This work tackles multimodal emotion recognition in conversations by introducing TS-GCL, a two-stage model that leverages graph contrastive learning to align and differentiate cross-modal representations. The architecture combines context-aware modality encoding, speaker embeddings, and a graph that links utterances across text, audio, and vision, with a graph contrastive objective guiding learning and a two-stage classifier refining emotion labels. The approach yields state-of-the-art results on IEMOCAP and MELD, demonstrating improved accuracy and F1 and enhanced robustness to modality heterogeneity. Overall, TS-GCL offers a principled, scalable framework for robust MER with potential for broader conversational understanding tasks.

Abstract

In terms of human-computer interaction, it is becoming more and more important to correctly understand the user's emotional state in a conversation, so the task of multimodal emotion recognition (MER) started to receive more attention. However, existing emotion classification methods usually perform classification only once. Sentences are likely to be misclassified in a single round of classification. Previous work usually ignores the similarities and differences between different morphological features in the fusion process. To address the above issues, we propose a two-stage emotion recognition model based on graph contrastive learning (TS-GCL). First, we encode the original dataset with different preprocessing modalities. Second, a graph contrastive learning (GCL) strategy is introduced for these three modal data with other structures to learn similarities and differences within and between modalities. Finally, we use MLP twice to achieve the final emotion classification. This staged classification method can help the model to better focus on different levels of emotional information, thereby improving the performance of the model. Extensive experiments show that TS-GCL has superior performance on IEMOCAP and MELD datasets compared with previous methods.

A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning

TL;DR

This work tackles multimodal emotion recognition in conversations by introducing TS-GCL, a two-stage model that leverages graph contrastive learning to align and differentiate cross-modal representations. The architecture combines context-aware modality encoding, speaker embeddings, and a graph that links utterances across text, audio, and vision, with a graph contrastive objective guiding learning and a two-stage classifier refining emotion labels. The approach yields state-of-the-art results on IEMOCAP and MELD, demonstrating improved accuracy and F1 and enhanced robustness to modality heterogeneity. Overall, TS-GCL offers a principled, scalable framework for robust MER with potential for broader conversational understanding tasks.

Abstract

In terms of human-computer interaction, it is becoming more and more important to correctly understand the user's emotional state in a conversation, so the task of multimodal emotion recognition (MER) started to receive more attention. However, existing emotion classification methods usually perform classification only once. Sentences are likely to be misclassified in a single round of classification. Previous work usually ignores the similarities and differences between different morphological features in the fusion process. To address the above issues, we propose a two-stage emotion recognition model based on graph contrastive learning (TS-GCL). First, we encode the original dataset with different preprocessing modalities. Second, a graph contrastive learning (GCL) strategy is introduced for these three modal data with other structures to learn similarities and differences within and between modalities. Finally, we use MLP twice to achieve the final emotion classification. This staged classification method can help the model to better focus on different levels of emotional information, thereby improving the performance of the model. Extensive experiments show that TS-GCL has superior performance on IEMOCAP and MELD datasets compared with previous methods.
Paper Structure (23 sections, 13 equations, 3 figures, 2 tables)

This paper contains 23 sections, 13 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An example of effective multimodal multi-emotion human-machine interaction in which multimodal emotion recognition plays a key role.
  • Figure 2: We propose the architecture of TS-GCL. It is mainly divided into three parts. The first part is feature extraction, using different preprocessing methods to process the original dataset. The second part is graph contrastive learning. This part describes in detail the graph construction process and the process of contrastive learning. The last part is two-stage classification, in which MLP is used for secondary classification in the emotion classification process, so as to achieve better classification results.
  • Figure 3: Ablation experiments on the IEMOCAP dataset. We conduct experiments on each component of TS-GCL and jointly compare with the benchmark models bc-LSTM, ICON and DialogueGCN.