Table of Contents
Fetching ...

Knowledge-Guided Dynamic Modality Attention Fusion Framework for Multimodal Sentiment Analysis

Xinyu Feng, Yuming Lin, Lihua He, You Li, Liang Chang, Ya Zhou

TL;DR

KuDA tackles the problem of static modality weighting in multimodal sentiment analysis by introducing a knowledge-guided dynamic attention fusion framework. It leverages sentiment knowledge injected through adapters, converts unimodal predictions into modality-specific sentiment ratios, and fuses modalities via dynamic attention blocks guided by these ratios and cross-modal interactions. A correlation-estimation loss via Noise-Contrastive Estimation encourages the multimodal representation to align with unimodal cues, and a two-stage training regime stabilizes this knowledge transfer. Empirical results on four benchmarks show state-of-the-art performance and robust adaptation to varying dominant modalities, underscoring KuDA's practical value for diverse real-world multimodal sentiment tasks.

Abstract

Multimodal Sentiment Analysis (MSA) utilizes multimodal data to infer the users' sentiment. Previous methods focus on equally treating the contribution of each modality or statically using text as the dominant modality to conduct interaction, which neglects the situation where each modality may become dominant. In this paper, we propose a Knowledge-Guided Dynamic Modality Attention Fusion Framework (KuDA) for multimodal sentiment analysis. KuDA uses sentiment knowledge to guide the model dynamically selecting the dominant modality and adjusting the contributions of each modality. In addition, with the obtained multimodal representation, the model can further highlight the contribution of dominant modality through the correlation evaluation loss. Extensive experiments on four MSA benchmark datasets indicate that KuDA achieves state-of-the-art performance and is able to adapt to different scenarios of dominant modality.

Knowledge-Guided Dynamic Modality Attention Fusion Framework for Multimodal Sentiment Analysis

TL;DR

KuDA tackles the problem of static modality weighting in multimodal sentiment analysis by introducing a knowledge-guided dynamic attention fusion framework. It leverages sentiment knowledge injected through adapters, converts unimodal predictions into modality-specific sentiment ratios, and fuses modalities via dynamic attention blocks guided by these ratios and cross-modal interactions. A correlation-estimation loss via Noise-Contrastive Estimation encourages the multimodal representation to align with unimodal cues, and a two-stage training regime stabilizes this knowledge transfer. Empirical results on four benchmarks show state-of-the-art performance and robust adaptation to varying dominant modalities, underscoring KuDA's practical value for diverse real-world multimodal sentiment tasks.

Abstract

Multimodal Sentiment Analysis (MSA) utilizes multimodal data to infer the users' sentiment. Previous methods focus on equally treating the contribution of each modality or statically using text as the dominant modality to conduct interaction, which neglects the situation where each modality may become dominant. In this paper, we propose a Knowledge-Guided Dynamic Modality Attention Fusion Framework (KuDA) for multimodal sentiment analysis. KuDA uses sentiment knowledge to guide the model dynamically selecting the dominant modality and adjusting the contributions of each modality. In addition, with the obtained multimodal representation, the model can further highlight the contribution of dominant modality through the correlation evaluation loss. Extensive experiments on four MSA benchmark datasets indicate that KuDA achieves state-of-the-art performance and is able to adapt to different scenarios of dominant modality.
Paper Structure (25 sections, 11 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 11 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Three video samples with vision, text, or audio as the dominant modality from CH-SIMS dataset. In the Ground Truth, V, T and A denote the vision, text and audio, respectively. M is the overall sentiment label of the video sample. The values of all labels are from -1 (negative) to 1 (positive).
  • Figure 2: The overall architecture of KuDA. The green, purple, orange, and gray modules represent the relevant operations of vision, text, audio and multimodal fusion.
  • Figure 3: The architecture of the BERT, transformer encoder, adapter, and their connection. The vertical gray rectangles denote the Transformer Encoder Layer. Down-FC and Up-FC represent the fully connected layers (FC) used to decrease and increase the dimension.
  • Figure 4: Architecture of a single dynamic attention block. The purple, green, and orange represent the interaction of multimodal representation with text, vision, and audio modalities, respectively.
  • Figure 5: Visualization of performance with change $\alpha$ on CH-SIMSv2 and MOSI.
  • ...and 3 more figures