Table of Contents
Fetching ...

Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

Tao Meng, Weilun Tang, Yuntao Shou, Yilong Tan, Jun Zhou, Wei Ai, Keqin Li

Abstract

Multimodal emotion recognition in conversations (MERC) aims to identify and understand the emotions expressed by speakers during utterance interaction from multiple modalities (e.g., text, audio, images, etc.). Existing studies have shown that GCN can improve the performance of MERC by modeling dependencies between speakers. However, existing methods usually use fixed parameters to process multimodal features for different emotion types, ignoring the dynamics of fusion between different modalities, which forces the model to balance performance between multiple emotion categories, thus limiting the model's performance on some specific emotions. To this end, we propose a dynamic fusion-aware graph convolutional neural network (DF-GCN) for robust recognition of multimodal emotion features in conversations. Specifically, DF-GCN integrates ordinary differential equations into graph convolutional networks (GCNs) to {capture} the dynamic nature of emotional dependencies within utterance interaction networks and leverages the prompts generated by the global information vector (GIV) of the utterance to guide the dynamic fusion of multimodal features. This allows our model to dynamically change parameters when processing each utterance feature, so that different network parameters can be equipped for different emotion categories in the inference stage, thereby achieving more flexible emotion classification and enhancing the generalization ability of the model. Comprehensive experiments conducted on two public multimodal conversational datasets {confirm} that the proposed DF-GCN model delivers superior performance, benefiting significantly from the dynamic fusion mechanism introduced.

Dynamic Fusion-Aware Graph Convolutional Neural Network for Multimodal Emotion Recognition in Conversations

Abstract

Multimodal emotion recognition in conversations (MERC) aims to identify and understand the emotions expressed by speakers during utterance interaction from multiple modalities (e.g., text, audio, images, etc.). Existing studies have shown that GCN can improve the performance of MERC by modeling dependencies between speakers. However, existing methods usually use fixed parameters to process multimodal features for different emotion types, ignoring the dynamics of fusion between different modalities, which forces the model to balance performance between multiple emotion categories, thus limiting the model's performance on some specific emotions. To this end, we propose a dynamic fusion-aware graph convolutional neural network (DF-GCN) for robust recognition of multimodal emotion features in conversations. Specifically, DF-GCN integrates ordinary differential equations into graph convolutional networks (GCNs) to {capture} the dynamic nature of emotional dependencies within utterance interaction networks and leverages the prompts generated by the global information vector (GIV) of the utterance to guide the dynamic fusion of multimodal features. This allows our model to dynamically change parameters when processing each utterance feature, so that different network parameters can be equipped for different emotion categories in the inference stage, thereby achieving more flexible emotion classification and enhancing the generalization ability of the model. Comprehensive experiments conducted on two public multimodal conversational datasets {confirm} that the proposed DF-GCN model delivers superior performance, benefiting significantly from the dynamic fusion mechanism introduced.
Paper Structure (29 sections, 25 equations, 6 figures, 5 tables)

This paper contains 29 sections, 25 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of the difference between traditional static fusion and dynamic fusion processes. (a) Previous studies usually use fixed network parameters to fuse multimodal features. (b) Our proposed method generates modal parameters through GIV, thereby achieving dynamic multimodal feature fusion.
  • Figure 2: The overall architecture of the proposed DF-GCN model. The framework consists of three main components: (i) Graph Construction, where multimodal utterances from text (RoBERTa), audio (OpenSMILE), and vision (DenseNet) are encoded and integrated through an attention layer and Bi-GRU to form discourse-level interaction graphs; (ii) Global Information Block, which employs stacked Transformer layers and global average pooling to generate GIV that guide context-aware fusion; and (iii) Prompts Generation Network, where DGCODE blocks dynamically generate weight prompts for adaptive multimodal feature integration. SGCODEcaptures structural dependencies, while DGCODE models temporal and contextual dynamics. Skip connections and batch normalization enhance training stability, and the final fused representations are used for emotion classification.
  • Figure 3: Verifying the effectiveness of multimodal features.
  • Figure 4: Sensitivity analysis of DF-GCN with respect to key graph construction hyperparameters. (a) Effect of the similarity threshold $\theta$ on WF1. (b) Effect of the context window size $w$ on WF1.
  • Figure 5: Confusion matrices of emotion classification results on the IEMOCAP and MELD test sets. Subfigures (a) and (b) present the performance of the proposed DF-GCN model on IEMOCAP and MELD, respectively, while subfigures (c) and (d) show the results of the baseline DER-GCN under the same settings. Darker diagonal blocks indicate higher accuracy in correctly identifying specific emotion categories, whereas off-diagonal values represent misclassifications.
  • ...and 1 more figures