Table of Contents
Fetching ...

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

Yunsheng Wang, Yuntao Shou, Yilong Tan, Wei Ai, Tao Meng, Keqin Li

TL;DR

An Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network (AMB-DSGDN) is proposed, which designs an adaptive modality balancing mechanism, which estimates a dropout probability for each modality according to its relative contribution in emotion modeling.

Abstract

Multimodal dialogue emotion recognition captures emotional cues by fusing text, visual, and audio modalities. However, existing approaches still suffer from notable limitations in modeling emotional dependencies and learning multimodal representations. On the one hand, they are unable to effectively filter out redundant or noisy signals within multimodal features, which hinders the accurate capture of the dynamic evolution of emotional states across and within speakers. On the other hand, during multimodal feature learning, dominant modalities tend to overwhelm the fusion process, thereby suppressing the complementary contributions of non-dominant modalities such as speech and vision, ultimately constraining the overall recognition performance. To address these challenges, we propose an Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network (AMB-DSGDN). Concretely, we first construct modality-specific subgraphs for text, speech, and vision, where each modality contains intra-speaker and inter-speaker graphs to capture both self-continuity and cross-speaker emotional dependencies. On top of these subgraphs, we introduce a differential graph attention mechanism, which computes the discrepancy between two sets of attention maps. By explicitly contrasting these attention distributions, the mechanism cancels out shared noise patterns while retaining modality-specific and context-relevant signals, thereby yielding purer and more discriminative emotional representations. In addition, we design an adaptive modality balancing mechanism, which estimates a dropout probability for each modality according to its relative contribution in emotion modeling.

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

TL;DR

An Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network (AMB-DSGDN) is proposed, which designs an adaptive modality balancing mechanism, which estimates a dropout probability for each modality according to its relative contribution in emotion modeling.

Abstract

Multimodal dialogue emotion recognition captures emotional cues by fusing text, visual, and audio modalities. However, existing approaches still suffer from notable limitations in modeling emotional dependencies and learning multimodal representations. On the one hand, they are unable to effectively filter out redundant or noisy signals within multimodal features, which hinders the accurate capture of the dynamic evolution of emotional states across and within speakers. On the other hand, during multimodal feature learning, dominant modalities tend to overwhelm the fusion process, thereby suppressing the complementary contributions of non-dominant modalities such as speech and vision, ultimately constraining the overall recognition performance. To address these challenges, we propose an Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network (AMB-DSGDN). Concretely, we first construct modality-specific subgraphs for text, speech, and vision, where each modality contains intra-speaker and inter-speaker graphs to capture both self-continuity and cross-speaker emotional dependencies. On top of these subgraphs, we introduce a differential graph attention mechanism, which computes the discrepancy between two sets of attention maps. By explicitly contrasting these attention distributions, the mechanism cancels out shared noise patterns while retaining modality-specific and context-relevant signals, thereby yielding purer and more discriminative emotional representations. In addition, we design an adaptive modality balancing mechanism, which estimates a dropout probability for each modality according to its relative contribution in emotion modeling.
Paper Structure (28 sections, 33 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 28 sections, 33 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: An authentic and representative segment illustrating the dynamic evolution of dialogue from the IEMOCAP dataset (Ses01F_impro01).
  • Figure 2: This architecture includes four core modules: first, the utterance-level encoder extracts unimodal features through OpenSmile (audio), RoBERTa (text), DenseNet (video), and after Transformer combining speaker embedding (Spk Emb) and position embedding (Pos Emb) encoding, obtains the text feature $\mathbf{x}_i^t$, video feature $\mathbf{x}_i^t$, audio feature $\mathbf{x}_i^t$ for the $i$-th utterance; second, the differential graph attention module constructs subgraphs including "intra-speaker subgraph (Intraspeaker)" and "inter-speaker subgraph (Interspeaker)" for each modality, computes differences between two groups of attention distributions through differential graph attention, eliminates cross-modal shared noise and retains modality-specific emotional signals; then, the adaptive modality balancing module computes dropout probabilities ($q_t/q_v/q_a$) for each modality based on batch-level performance, performs dynamic dropout on dominant modalities, while scaling retained features through gradient compensation to maintain information balance; finally, through multimodal fusion and classification module, fuses the balanced features via linear layers and inputs into the classifier to obtain emotion recognition results.
  • Figure 3: Experimental results of different window sizes on IEMOCAP and MELD datasets. The purple line represents wa-ACC, the light yellow bar represents wa-F1; the left subplot corresponds to IEMOCAP dataset, the right subplot to MELD dataset. Note: Window size determines the semantic association capture range of the graph convolutional network and needs to be adjusted based on dataset characteristics.
  • Figure 4: Sensitivity analysis results of attention head numbers on IEMOCAP and MELD datasets. The blue line represents wa-F1 on IEMOCAP dataset, the green line represents wa-F1 on MELD dataset, horizontal axis “nhead Values” indicates number of attention heads, vertical axis “wa-F1 Score” indicates corresponding weighted average F1 score.
  • Figure 5: Performance dynamics in the warm-up phase. The horizontal axis represents the warm-up period (i.e., the number of training rounds before the modality dropout mechanism is gradually activated), the vertical axis is model performance metrics; different color curves correspond to weighted average F1 scores and accuracies on IEMOCAP and MELD datasets respectively, showing the impact of different warm-up periods on model performance.
  • ...and 3 more figures