Table of Contents
Fetching ...

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

Tao Meng, Fuchen Zhang, Yuntao Shou, Hongen Shao, Wei Ai, Keqin Li

TL;DR

This work targets Multimodal Emotion Recognition in Conversation (MERC) by addressing pre-fusion semantic alignment and intra-modal noise. It introduces Masked Graph Learning with Recurrent Alignment (MGLRA), which combines graph attention filtering, memory-based recursive feature alignment (MRFA), and cross-modal multi-head attention to iteratively align text, audio, and vision modalities before fusion. Fusion is performed via a lightweight masked GCN that incorporates speaker information, followed by an MLP classifier. Experiments on IEMOCAP and MELD show that MGLRA achieves state-of-the-art or on-par performance with improved efficiency, validating the effectiveness of iterative alignment and masking strategies for robust MERC in noisy, real-world settings.

Abstract

Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that {MGLRA} outperforms state-of-the-art methods.

Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation

TL;DR

This work targets Multimodal Emotion Recognition in Conversation (MERC) by addressing pre-fusion semantic alignment and intra-modal noise. It introduces Masked Graph Learning with Recurrent Alignment (MGLRA), which combines graph attention filtering, memory-based recursive feature alignment (MRFA), and cross-modal multi-head attention to iteratively align text, audio, and vision modalities before fusion. Fusion is performed via a lightweight masked GCN that incorporates speaker information, followed by an MLP classifier. Experiments on IEMOCAP and MELD show that MGLRA achieves state-of-the-art or on-par performance with improved efficiency, validating the effectiveness of iterative alignment and masking strategies for robust MERC in noisy, real-world settings.

Abstract

Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that {MGLRA} outperforms state-of-the-art methods.
Paper Structure (35 sections, 23 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 23 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: An example to illustrate the importance of alignment before performing multi-modal fusion and the difference from existing methods. (a) The example demonstrates the first type of noise in multimodal emotion recognition. Red and blue represent visual information and textual information, respectively. (b) Previous alignment methods for MERC. (c) Our alignment method MGLRA.
  • Figure 2: We propose the architecture of MGLRA. In the preprocessing stage, we use different feature extractors for the structural features of different modality data. In the multimodal feature alignment stage, we use a graph filtering mechanism for noise reduction and propose an alignment architecture with a memory iteration mechanism to enhance semantic features. Moreover, the speaker's information is incorporated into the construction process of the graph. Then the masked GCN is used to fuse the semantics to achieve the final emotion label classification.
  • Figure 3: Detailed pipeline for aligning multimodal data using MRFA and cross-modal multi-head attention. First, each modality has a corresponding memory block for information storage. Then, a single-modal attention mechanism is used to extract intra-modal information. Finally, cross-modal multi-head attention is used to achieve multi-modal feature fusion. Here we use two modes as examples, and the three modes in the paper cross each other in pairs.
  • Figure 4: Randomly mask the nodes on the graph, and use GCN for information aggregation to achieve the final fusion of multimodal features and emotion classification.
  • Figure 5: Emotion label distribution on IEMOCAP and MELD datasets. Compared with MELD's emotional label distribution, IEMOCAP has a severe data imbalance problem, indicating that it is more difficult to identify during the experiment.
  • ...and 4 more figures