Robust Multimodal Sentiment Analysis via Double Information Bottleneck
Huiting Huang, Tieliang Gong, Kai He, Jialun Wu, Erik Cambria, Mengling Feng
TL;DR
The paper introduces the Double Information Bottleneck (DIB) framework for robust multimodal sentiment analysis by coupling a low-rank Rényi entropy-based information bottleneck (LRIB) for unimodal representation learning with an attention bottleneck fusion for cross-modal integration. By replacing traditional Shannon-entropy IB with LRIB, the method directly operates on sample-based kernel representations, improving robustness to noise and scalability in high dimensions. DIB jointly optimizes unimodal compression and cross-modal relevance, selecting the textual modality as a strong predictive anchor while leveraging cross-modal cues via a compact bottleneck fusion. Across CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single, DIB achieves state-of-the-art or competitive results and demonstrates exceptional resilience to noise and missing modalities, highlighting its practical value for real-world multimodal systems. The work suggests promising extensions to broader multimodal tasks and future improvements such as adaptive supervision and visual grounding to further enhance robustness and generalization.
Abstract
Multimodal sentiment analysis has received significant attention across diverse research domains. Despite advancements in algorithm design, existing approaches suffer from two critical limitations: insufficient learning of noise-contaminated unimodal data, leading to corrupted cross-modal interactions, and inadequate fusion of multimodal representations, resulting in discarding discriminative unimodal information while retaining multimodal redundant information. To address these challenges, this paper proposes a Double Information Bottleneck (DIB) strategy to obtain a powerful, unified compact multimodal representation. Implemented within the framework of low-rank Renyi's entropy functional, DIB offers enhanced robustness against diverse noise sources and computational tractability for high-dimensional data, as compared to the conventional Shannon entropy-based methods. The DIB comprises two key modules: 1) learning a sufficient and compressed representation of individual unimodal data by maximizing the task-relevant information and discarding the superfluous information, and 2) ensuring the discriminative ability of multimodal representation through a novel attention bottleneck fusion mechanism. Consequently, DIB yields a multimodal representation that effectively filters out noisy information from unimodal data while capturing inter-modal complementarity. Extensive experiments on CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single validate the effectiveness of our method. The model achieves 47.4% accuracy under the Acc-7 metric on CMU-MOSI and 81.63% F1-score on CH-SIMS, outperforming the second-best baseline by 1.19%. Under noise, it shows only 0.36% and 0.29% performance degradation on CMU-MOSI and CMU-MOSEI respectively.
