Table of Contents
Fetching ...

Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis

Han Wu, Yanming Sun, Yunhe Yang, Derek F. Wong

TL;DR

This paper tackles robust multimodal sentiment analysis by addressing the failure of simple fusion to handle noisy or conflicting modalities. It introduces Adaptive Gated Fusion Network (AGFN), a dual-gated fusion mechanism combining an Information Entropy Gate and a Modality Importance Gate, balanced by a learnable parameter, with cross-modal attention and VAT for robustness. AGFN demonstrates state-of-the-art performance on CMU-MOSI and competitive results on CMU-MOSEI, while empirical analyses show it learns a more robust, generalized feature space evidenced by a reduced Prediction Space Correlation (PSC). The approach advances practical MSA by explicitly modeling information reliability and modality importance, enabling better suppression of noise and conflicts in real-world data.

Abstract

Multimodal sentiment analysis (MSA) leverages information fusion from diverse modalities (e.g., text, audio, visual) to enhance sentiment prediction. However, simple fusion techniques often fail to account for variations in modality quality, such as those that are noisy, missing, or semantically conflicting. This oversight leads to suboptimal performance, especially in discerning subtle emotional nuances. To mitigate this limitation, we introduce a simple yet efficient \textbf{A}daptive \textbf{G}ated \textbf{F}usion \textbf{N}etwork that adaptively adjusts feature weights via a dual gate fusion mechanism based on information entropy and modality importance. This mechanism mitigates the influence of noisy modalities and prioritizes informative cues following unimodal encoding and cross-modal interaction. Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance. Visualization analysis of feature representations demonstrates that AGFN enhances generalization by learning from a broader feature distribution, achieved by reducing the correlation between feature location and prediction error, thereby decreasing reliance on specific locations and creating more robust multimodal feature representations.

Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis

TL;DR

This paper tackles robust multimodal sentiment analysis by addressing the failure of simple fusion to handle noisy or conflicting modalities. It introduces Adaptive Gated Fusion Network (AGFN), a dual-gated fusion mechanism combining an Information Entropy Gate and a Modality Importance Gate, balanced by a learnable parameter, with cross-modal attention and VAT for robustness. AGFN demonstrates state-of-the-art performance on CMU-MOSI and competitive results on CMU-MOSEI, while empirical analyses show it learns a more robust, generalized feature space evidenced by a reduced Prediction Space Correlation (PSC). The approach advances practical MSA by explicitly modeling information reliability and modality importance, enabling better suppression of noise and conflicts in real-world data.

Abstract

Multimodal sentiment analysis (MSA) leverages information fusion from diverse modalities (e.g., text, audio, visual) to enhance sentiment prediction. However, simple fusion techniques often fail to account for variations in modality quality, such as those that are noisy, missing, or semantically conflicting. This oversight leads to suboptimal performance, especially in discerning subtle emotional nuances. To mitigate this limitation, we introduce a simple yet efficient \textbf{A}daptive \textbf{G}ated \textbf{F}usion \textbf{N}etwork that adaptively adjusts feature weights via a dual gate fusion mechanism based on information entropy and modality importance. This mechanism mitigates the influence of noisy modalities and prioritizes informative cues following unimodal encoding and cross-modal interaction. Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance. Visualization analysis of feature representations demonstrates that AGFN enhances generalization by learning from a broader feature distribution, achieved by reducing the correlation between feature location and prediction error, thereby decreasing reliance on specific locations and creating more robust multimodal feature representations.

Paper Structure

This paper contains 21 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: An example of modal inconsistency from the CH-SIMS dataset yu2020ch. In this case, the text promotes unity, but the visuals convey a negative affect, creating a conflict.
  • Figure 2: Overview of the AGFN architecture. Text, audio, and video inputs are processed by modality-specific encoders. Cross-modal attention refines these features. The enhanced features are fed into the dual-gated adaptive fusion module: (1) The information entropy gate assesses modality certainty, and (2) the modality importance gate learns sample-specific weights. A learned parameter $\alpha$ adaptively balances contributions from both gates to produce the final fused feature representation.
  • Figure 3: t-SNE visualization of final fused representations on the CMU-MOSI test set. Left: Features learned via a simple fusion (concatenation) baseline. Right: Features learned using an adaptive fusion strategy. Points are colored by sentiment polarity.
  • Figure 4: Case Study on the CH-SIMS Dataset. Each case displays the unimodal sentiment scores (T: Text, A: Audio, V: Visual), the ground truth multimodal label (M), and the prediction results from our full AGFN model and the model without gating "w/o Gating").