Table of Contents
Fetching ...

Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition

Yuntao Shou, Tao Meng, Wei Ai, Nan Yin, Keqin Li

TL;DR

The paper tackles multimodal emotion recognition by addressing modality heterogeneity across text, video, and audio. It introduces AR-IIGCN, which combines cross-modal adversarial representation learning with two graph-based contrastive learning streams (ICCL and IMCL) to capture intra-/inter-modal and intra-class/inter-class relationships, respectively, before an MLP-based emotion classifier. The approach includes a tri-modal GAN for cross-modal fusion, a speaker-relational graph per modality, and a joint loss that blends contrastive and classification objectives. Empirical results on IEMOCAP and MELD show substantial improvements over strong baselines, demonstrating the method’s ability to learn clearer emotion boundaries and leverage complementary multimodal information for robust MER. The work also provides extensive ablations and ablates facilitative components, confirming the necessity of modality-heterogeneity removal and the effectiveness of graph-contrastive representation learning for MER.

Abstract

With the release of increasing open-source emotion recognition datasets on social media platforms and the rapid development of computing resources, multimodal emotion recognition tasks (MER) have begun to receive widespread research attention. The MER task extracts and fuses complementary semantic information from different modalities, which can classify the speaker's emotions. However, the existing feature fusion methods have usually mapped the features of different modalities into the same feature space for information fusion, which can not eliminate the heterogeneity between different modalities. Therefore, it is challenging to make the subsequent emotion class boundary learning. To tackle the above problems, we have proposed a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation, which can achieve information interaction between modalities and eliminate heterogeneity among modalities. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Specifically, we construct a graph structure for three modal features and perform contrastive representation learning on nodes with different emotions in the same modality and the same emotion in different modalities, which can improve the feature representation ability of nodes. Extensive experimental works show that the ARL-IIGCN method can significantly improve emotion recognition accuracy on IEMOCAP and MELD datasets.

Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition

TL;DR

The paper tackles multimodal emotion recognition by addressing modality heterogeneity across text, video, and audio. It introduces AR-IIGCN, which combines cross-modal adversarial representation learning with two graph-based contrastive learning streams (ICCL and IMCL) to capture intra-/inter-modal and intra-class/inter-class relationships, respectively, before an MLP-based emotion classifier. The approach includes a tri-modal GAN for cross-modal fusion, a speaker-relational graph per modality, and a joint loss that blends contrastive and classification objectives. Empirical results on IEMOCAP and MELD show substantial improvements over strong baselines, demonstrating the method’s ability to learn clearer emotion boundaries and leverage complementary multimodal information for robust MER. The work also provides extensive ablations and ablates facilitative components, confirming the necessity of modality-heterogeneity removal and the effectiveness of graph-contrastive representation learning for MER.

Abstract

With the release of increasing open-source emotion recognition datasets on social media platforms and the rapid development of computing resources, multimodal emotion recognition tasks (MER) have begun to receive widespread research attention. The MER task extracts and fuses complementary semantic information from different modalities, which can classify the speaker's emotions. However, the existing feature fusion methods have usually mapped the features of different modalities into the same feature space for information fusion, which can not eliminate the heterogeneity between different modalities. Therefore, it is challenging to make the subsequent emotion class boundary learning. To tackle the above problems, we have proposed a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation, which can achieve information interaction between modalities and eliminate heterogeneity among modalities. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Specifically, we construct a graph structure for three modal features and perform contrastive representation learning on nodes with different emotions in the same modality and the same emotion in different modalities, which can improve the feature representation ability of nodes. Extensive experimental works show that the ARL-IIGCN method can significantly improve emotion recognition accuracy on IEMOCAP and MELD datasets.
Paper Structure (37 sections, 22 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 37 sections, 22 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustrative example of the effect of different feature fusion methods on sentiment classification. (a) The feature extraction process for text, video and audio modalities. (b) The model learns emotion class boundaries in a feature fusion manner to map a common feature space. (c) The model learns emotion class boundaries in a feature fusion manner to map a separate feature space. Specifically, there is heterogeneity among modalities in feature fusion with common space, which leads to the misalignment of peaks between modalities.
  • Figure 2: The overall framework of the Adversarial Representation Learning with Intra-Modal and Inter-Modal Graph Contrastive Learning consists of a data preprocessing layer, a multimodal feature fusion layer, a graph contrastive representation learning layer, and an emotion classification layer.
  • Figure 3: The overall graph contrastive representation learning process refers to intra-modal, inter-modal, and intra-class and inter-class comparisons.
  • Figure 4: Experimental results on RoBERTa-Large. (a) Effect of different modal margins $\beta$ on model training results. (b) Effect of different contrastive learning methods on model training results.
  • Figure 5: We use different batch sizes with RoBERTa-Large to verify the stability experiments on IEMOCAP and MELD datasets.
  • ...and 1 more figures