A Two-Stage Multimodal Emotion Recognition Model Based on Graph Contrastive Learning
Wei Ai, FuChen Zhang, Tao Meng, YunTao Shou, HongEn Shao, Keqin Li
TL;DR
This work tackles multimodal emotion recognition in conversations by introducing TS-GCL, a two-stage model that leverages graph contrastive learning to align and differentiate cross-modal representations. The architecture combines context-aware modality encoding, speaker embeddings, and a graph that links utterances across text, audio, and vision, with a graph contrastive objective guiding learning and a two-stage classifier refining emotion labels. The approach yields state-of-the-art results on IEMOCAP and MELD, demonstrating improved accuracy and F1 and enhanced robustness to modality heterogeneity. Overall, TS-GCL offers a principled, scalable framework for robust MER with potential for broader conversational understanding tasks.
Abstract
In terms of human-computer interaction, it is becoming more and more important to correctly understand the user's emotional state in a conversation, so the task of multimodal emotion recognition (MER) started to receive more attention. However, existing emotion classification methods usually perform classification only once. Sentences are likely to be misclassified in a single round of classification. Previous work usually ignores the similarities and differences between different morphological features in the fusion process. To address the above issues, we propose a two-stage emotion recognition model based on graph contrastive learning (TS-GCL). First, we encode the original dataset with different preprocessing modalities. Second, a graph contrastive learning (GCL) strategy is introduced for these three modal data with other structures to learn similarities and differences within and between modalities. Finally, we use MLP twice to achieve the final emotion classification. This staged classification method can help the model to better focus on different levels of emotional information, thereby improving the performance of the model. Extensive experiments show that TS-GCL has superior performance on IEMOCAP and MELD datasets compared with previous methods.
