A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion
Wei Dai, Dequan Zheng, Feng Yu, Yanrong Zhang, Yaohui Hou
TL;DR
This work tackles multimodal emotion recognition by fusing text, audio, and visual information more effectively. It introduces DeepMSI-MER, a three-stage architecture that combines modality-specific feature extraction, early and late fusion, and a contrastive training objective to align cross-modal representations while using a visual sequence compression module to reduce visual redundancy. Across IEMOCAP and MELD, the method achieves state-of-the-art results, with notable improvements in neutral and other challenging emotion categories, and ablation studies confirm the benefits of three-modal fusion and VSC. The approach offers practical advantages for robust emotion understanding in human-computer interaction, mental health monitoring, and intelligent services by enhancing cross-modal feature fusion and reducing computational load on visual streams.
Abstract
With the advancement of artificial intelligence and computer vision technologies, multimodal emotion recognition has become a prominent research topic. However, existing methods face challenges such as heterogeneous data fusion and the effective utilization of modality correlations. This paper proposes a novel multimodal emotion recognition approach, DeepMSI-MER, based on the integration of contrastive learning and visual sequence compression. The proposed method enhances cross-modal feature fusion through contrastive learning and reduces redundancy in the visual modality by leveraging visual sequence compression. Experimental results on two public datasets, IEMOCAP and MELD, demonstrate that DeepMSI-MER significantly improves the accuracy and robustness of emotion recognition, validating the effectiveness of multimodal feature fusion and the proposed approach.
