Table of Contents
Fetching ...

A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion

Wei Dai, Dequan Zheng, Feng Yu, Yanrong Zhang, Yaohui Hou

TL;DR

This work tackles multimodal emotion recognition by fusing text, audio, and visual information more effectively. It introduces DeepMSI-MER, a three-stage architecture that combines modality-specific feature extraction, early and late fusion, and a contrastive training objective to align cross-modal representations while using a visual sequence compression module to reduce visual redundancy. Across IEMOCAP and MELD, the method achieves state-of-the-art results, with notable improvements in neutral and other challenging emotion categories, and ablation studies confirm the benefits of three-modal fusion and VSC. The approach offers practical advantages for robust emotion understanding in human-computer interaction, mental health monitoring, and intelligent services by enhancing cross-modal feature fusion and reducing computational load on visual streams.

Abstract

With the advancement of artificial intelligence and computer vision technologies, multimodal emotion recognition has become a prominent research topic. However, existing methods face challenges such as heterogeneous data fusion and the effective utilization of modality correlations. This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. The proposed method enhances cross-modal feature fusion through contrastive learning and reduces redundancy in the visual modality by leveraging visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition, validating the effectiveness of multimodal feature fusion and the proposed approach.

A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion

TL;DR

This work tackles multimodal emotion recognition by fusing text, audio, and visual information more effectively. It introduces DeepMSI-MER, a three-stage architecture that combines modality-specific feature extraction, early and late fusion, and a contrastive training objective to align cross-modal representations while using a visual sequence compression module to reduce visual redundancy. Across IEMOCAP and MELD, the method achieves state-of-the-art results, with notable improvements in neutral and other challenging emotion categories, and ablation studies confirm the benefits of three-modal fusion and VSC. The approach offers practical advantages for robust emotion understanding in human-computer interaction, mental health monitoring, and intelligent services by enhancing cross-modal feature fusion and reducing computational load on visual streams.

Abstract

With the advancement of artificial intelligence and computer vision technologies, multimodal emotion recognition has become a prominent research topic. However, existing methods face challenges such as heterogeneous data fusion and the effective utilization of modality correlations. This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. The proposed method enhances cross-modal feature fusion through contrastive learning and reduces redundancy in the visual modality by leveraging visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition, validating the effectiveness of multimodal feature fusion and the proposed approach.

Paper Structure

This paper contains 20 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The overall architecture of DeepMSI-MER for multimodal emotion recognition. DeepMSI-MER consists of a high-level semantic feature module, an early feature fusion module, and a late feature fusion module. The high-level semantic feature module fuses the semantic features of text and audio to further extract contextual semantic features, which are ultimately used in VSC-Swin.
  • Figure 2: Visual Sequence Compression Process.
  • Figure 3: VSC-Swin Model Improvement.
  • Figure 4: VSC-Swin Visual Sequence Compression Process.
  • Figure 5: TCN Model Architecture.
  • ...and 2 more figures