Beyond Simple Fusion: Adaptive Gated Fusion for Robust Multimodal Sentiment Analysis
Han Wu, Yanming Sun, Yunhe Yang, Derek F. Wong
TL;DR
This paper tackles robust multimodal sentiment analysis by addressing the failure of simple fusion to handle noisy or conflicting modalities. It introduces Adaptive Gated Fusion Network (AGFN), a dual-gated fusion mechanism combining an Information Entropy Gate and a Modality Importance Gate, balanced by a learnable parameter, with cross-modal attention and VAT for robustness. AGFN demonstrates state-of-the-art performance on CMU-MOSI and competitive results on CMU-MOSEI, while empirical analyses show it learns a more robust, generalized feature space evidenced by a reduced Prediction Space Correlation (PSC). The approach advances practical MSA by explicitly modeling information reliability and modality importance, enabling better suppression of noise and conflicts in real-world data.
Abstract
Multimodal sentiment analysis (MSA) leverages information fusion from diverse modalities (e.g., text, audio, visual) to enhance sentiment prediction. However, simple fusion techniques often fail to account for variations in modality quality, such as those that are noisy, missing, or semantically conflicting. This oversight leads to suboptimal performance, especially in discerning subtle emotional nuances. To mitigate this limitation, we introduce a simple yet efficient \textbf{A}daptive \textbf{G}ated \textbf{F}usion \textbf{N}etwork that adaptively adjusts feature weights via a dual gate fusion mechanism based on information entropy and modality importance. This mechanism mitigates the influence of noisy modalities and prioritizes informative cues following unimodal encoding and cross-modal interaction. Experiments on CMU-MOSI and CMU-MOSEI show that AGFN significantly outperforms strong baselines in accuracy, effectively discerning subtle emotions with robust performance. Visualization analysis of feature representations demonstrates that AGFN enhances generalization by learning from a broader feature distribution, achieved by reducing the correlation between feature location and prediction error, thereby decreasing reliance on specific locations and creating more robust multimodal feature representations.
