Table of Contents
Fetching ...

Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

Ying Liu, Yuntao Shou, Wei Ai, Tao Meng, Keqin Li

Abstract

In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.

Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

Abstract

In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.

Paper Structure

This paper contains 36 sections, 32 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Performance comparison of various models on the IEMOCAP dataset using unimodal features. T, V, and A denote the textual, visual, and acoustic modalities, respectively.
  • Figure 2: The overall framework is as follows: first, extract utterance-level unimodal features from audio, video, and text; then, apply a differential Transformer to audio and video features for dynamic extraction and denoising, while modeling emotional dependencies of textual features through a relational subgraph module comprising InterGAT and IntraGAT; finally, perform emotion classification by fusing multimodal features with text as the core using cross-modal diffusion attention.
  • Figure 3: Effect of varying window sizes on model performance across two datasets.
  • Figure 4: The impact of different $\lambda_m$ values on the model performance. Note that a horizontal coordinate of 0 means that only fusion loss is used as the total loss.
  • Figure 5: Performance comparison of differential strategies on the IEMOCAP and MELD datasets.
  • ...and 8 more figures