Table of Contents
Fetching ...

Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations

Yuntao Shou, Wei Ai, Jiayi Du, Tao Meng, Haiyan Liu, Nan Yin

TL;DR

This work addresses MERC by introducing ELR-GNN, a graph-based model that efficiently captures long-distance latent dependencies across utterances while fusing multi-modal cues. It combines sequential context extraction via Bi-LSTM, a speaker-relationship graph, and a GFP-based propagation framework with top-$k$ sparsification to model distant contextual dependencies. An auxiliary information module (AIM) performs denoising and dual fusion (early and adaptive late) of modal and contextual signals, yielding high-level discourse features fed into a final MLP for emotion prediction. Experimental results on IEMOCAP and MELD demonstrate state-of-the-art accuracy/F1 and substantial running-time reductions, underscoring the method’s practical impact for scalable, accurate MER in conversational AI.

Abstract

The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52\% and 35\%, respectively.

Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations

TL;DR

This work addresses MERC by introducing ELR-GNN, a graph-based model that efficiently captures long-distance latent dependencies across utterances while fusing multi-modal cues. It combines sequential context extraction via Bi-LSTM, a speaker-relationship graph, and a GFP-based propagation framework with top- sparsification to model distant contextual dependencies. An auxiliary information module (AIM) performs denoising and dual fusion (early and adaptive late) of modal and contextual signals, yielding high-level discourse features fed into a final MLP for emotion prediction. Experimental results on IEMOCAP and MELD demonstrate state-of-the-art accuracy/F1 and substantial running-time reductions, underscoring the method’s practical impact for scalable, accurate MER in conversational AI.

Abstract

The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52\% and 35\%, respectively.
Paper Structure (26 sections, 13 equations, 3 figures, 4 tables)

This paper contains 26 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture of ELR-GCN for milti-modal emotion recognition. ELR-GCN contains auxiliary information module and graph random neural network module. The auxiliary information module is used to achieve further extraction of contextual semantic information and fusion of speaker relationships and long-distance latent relationships through early and adaptive late fusion. The graph random neural network module is used to model speaker relationships and long-distance contextual latent dependencies.
  • Figure 2: Confusion matrix of ELR-GNN and LR-GNN classification on IEMOCAP and MELD datasets.
  • Figure 3: We tested the impact of the maximum neighborhood size and parameter $r_{max}$ in ELR-GNN on the accuracy and running time of emotion recognition.