Table of Contents
Fetching ...

MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition

Feng Li, Ke Wu, Yongwei Li

TL;DR

This paper tackles multimodal emotion recognition by addressing modality heterogeneity, dynamic FAU modeling, and class imbalance. It introduces MCN-CL, which combines a PSA-enhanced visual feature extractor with a layered, bidirectional cross-attention network and a contrastive learning objective featuring hard negative mining to align cross-modal representations. Key innovations include the PSA module for temporal FAU dynamics, a triple-query cross-attention scheme across text, audio, and visual modalities, and a modality-specific contrastive loss. Empirical results on IEMOCAP and MELD show competitive Weighted F1 improvements over state-of-the-art methods, indicating practical viability for robust multimodal emotion recognition.

Abstract

Multimodal emotion recognition plays a key role in many domains, including mental health monitoring, educational interaction, and human-computer interaction. However, existing methods often face three major challenges: unbalanced category distribution, the complexity of dynamic facial action unit time modeling, and the difficulty of feature fusion due to modal heterogeneity. With the explosive growth of multimodal data in social media scenarios, the need for building an efficient cross-modal fusion framework for emotion recognition is becoming increasingly urgent. To this end, this paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition. It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues, effectively addressing the issues of modal heterogeneity and category imbalance. Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches, with Weighted F1 scores improving by 3.42% and 5.73%, respectively.

MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition

TL;DR

This paper tackles multimodal emotion recognition by addressing modality heterogeneity, dynamic FAU modeling, and class imbalance. It introduces MCN-CL, which combines a PSA-enhanced visual feature extractor with a layered, bidirectional cross-attention network and a contrastive learning objective featuring hard negative mining to align cross-modal representations. Key innovations include the PSA module for temporal FAU dynamics, a triple-query cross-attention scheme across text, audio, and visual modalities, and a modality-specific contrastive loss. Empirical results on IEMOCAP and MELD show competitive Weighted F1 improvements over state-of-the-art methods, indicating practical viability for robust multimodal emotion recognition.

Abstract

Multimodal emotion recognition plays a key role in many domains, including mental health monitoring, educational interaction, and human-computer interaction. However, existing methods often face three major challenges: unbalanced category distribution, the complexity of dynamic facial action unit time modeling, and the difficulty of feature fusion due to modal heterogeneity. With the explosive growth of multimodal data in social media scenarios, the need for building an efficient cross-modal fusion framework for emotion recognition is becoming increasingly urgent. To this end, this paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition. It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues, effectively addressing the issues of modal heterogeneity and category imbalance. Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches, with Weighted F1 scores improving by 3.42% and 5.73%, respectively.

Paper Structure

This paper contains 15 sections, 31 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The framework of our proposed method.
  • Figure 2: The overview of PSA-enhanced Visual Feature Extraction Module.