Table of Contents
Fetching ...

Robust Multimodal Sentiment Analysis via Double Information Bottleneck

Huiting Huang, Tieliang Gong, Kai He, Jialun Wu, Erik Cambria, Mengling Feng

TL;DR

The paper introduces the Double Information Bottleneck (DIB) framework for robust multimodal sentiment analysis by coupling a low-rank Rényi entropy-based information bottleneck (LRIB) for unimodal representation learning with an attention bottleneck fusion for cross-modal integration. By replacing traditional Shannon-entropy IB with LRIB, the method directly operates on sample-based kernel representations, improving robustness to noise and scalability in high dimensions. DIB jointly optimizes unimodal compression and cross-modal relevance, selecting the textual modality as a strong predictive anchor while leveraging cross-modal cues via a compact bottleneck fusion. Across CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single, DIB achieves state-of-the-art or competitive results and demonstrates exceptional resilience to noise and missing modalities, highlighting its practical value for real-world multimodal systems. The work suggests promising extensions to broader multimodal tasks and future improvements such as adaptive supervision and visual grounding to further enhance robustness and generalization.

Abstract

Multimodal sentiment analysis has received significant attention across diverse research domains. Despite advancements in algorithm design, existing approaches suffer from two critical limitations: insufficient learning of noise-contaminated unimodal data, leading to corrupted cross-modal interactions, and inadequate fusion of multimodal representations, resulting in discarding discriminative unimodal information while retaining multimodal redundant information. To address these challenges, this paper proposes a Double Information Bottleneck (DIB) strategy to obtain a powerful, unified compact multimodal representation. Implemented within the framework of low-rank Renyi's entropy functional, DIB offers enhanced robustness against diverse noise sources and computational tractability for high-dimensional data, as compared to the conventional Shannon entropy-based methods. The DIB comprises two key modules: 1) learning a sufficient and compressed representation of individual unimodal data by maximizing the task-relevant information and discarding the superfluous information, and 2) ensuring the discriminative ability of multimodal representation through a novel attention bottleneck fusion mechanism. Consequently, DIB yields a multimodal representation that effectively filters out noisy information from unimodal data while capturing inter-modal complementarity. Extensive experiments on CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single validate the effectiveness of our method. The model achieves 47.4% accuracy under the Acc-7 metric on CMU-MOSI and 81.63% F1-score on CH-SIMS, outperforming the second-best baseline by 1.19%. Under noise, it shows only 0.36% and 0.29% performance degradation on CMU-MOSI and CMU-MOSEI respectively.

Robust Multimodal Sentiment Analysis via Double Information Bottleneck

TL;DR

The paper introduces the Double Information Bottleneck (DIB) framework for robust multimodal sentiment analysis by coupling a low-rank Rényi entropy-based information bottleneck (LRIB) for unimodal representation learning with an attention bottleneck fusion for cross-modal integration. By replacing traditional Shannon-entropy IB with LRIB, the method directly operates on sample-based kernel representations, improving robustness to noise and scalability in high dimensions. DIB jointly optimizes unimodal compression and cross-modal relevance, selecting the textual modality as a strong predictive anchor while leveraging cross-modal cues via a compact bottleneck fusion. Across CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single, DIB achieves state-of-the-art or competitive results and demonstrates exceptional resilience to noise and missing modalities, highlighting its practical value for real-world multimodal systems. The work suggests promising extensions to broader multimodal tasks and future improvements such as adaptive supervision and visual grounding to further enhance robustness and generalization.

Abstract

Multimodal sentiment analysis has received significant attention across diverse research domains. Despite advancements in algorithm design, existing approaches suffer from two critical limitations: insufficient learning of noise-contaminated unimodal data, leading to corrupted cross-modal interactions, and inadequate fusion of multimodal representations, resulting in discarding discriminative unimodal information while retaining multimodal redundant information. To address these challenges, this paper proposes a Double Information Bottleneck (DIB) strategy to obtain a powerful, unified compact multimodal representation. Implemented within the framework of low-rank Renyi's entropy functional, DIB offers enhanced robustness against diverse noise sources and computational tractability for high-dimensional data, as compared to the conventional Shannon entropy-based methods. The DIB comprises two key modules: 1) learning a sufficient and compressed representation of individual unimodal data by maximizing the task-relevant information and discarding the superfluous information, and 2) ensuring the discriminative ability of multimodal representation through a novel attention bottleneck fusion mechanism. Consequently, DIB yields a multimodal representation that effectively filters out noisy information from unimodal data while capturing inter-modal complementarity. Extensive experiments on CMU-MOSI, CMU-MOSEI, CH-SIMS, and MVSA-Single validate the effectiveness of our method. The model achieves 47.4% accuracy under the Acc-7 metric on CMU-MOSI and 81.63% F1-score on CH-SIMS, outperforming the second-best baseline by 1.19%. Under noise, it shows only 0.36% and 0.29% performance degradation on CMU-MOSI and CMU-MOSEI respectively.

Paper Structure

This paper contains 32 sections, 1 theorem, 29 equations, 17 figures, 9 tables, 1 algorithm.

Key Result

Proposition 4.3

For any given $X$, $Y$, the mapping $\mathrm{IB}{^k_{\alpha}}$ have the following properties:

Figures (17)

  • Figure 1: A visual-text pair example illustrating the unimodal contamination and cross-modal fusion problems: a) contaminated unimodal data includes redundancy (e.g. high similarity between consecutive frames), background noise, modality misalignment (e.g. objects mentioned in the transcript are not visible in the video) and missing data. b) the above contaminated unimodal data leads to corrupted and insufficient cross-modal interaction.
  • Figure 2: Comparison of traditional entropy measure and low-rank Rényi’s entropy. Darker colors represent key patterns of features, while lighter colors indicate irrelevant features. The low-rank constraint in the Rényi's entropy ensures that only a few principal patterns in the multimodal features are retained in the representation, capturing the most salient features while ignoring the irrelevant and noisy parts.
  • Figure 3: The architecture of the proposed DIB model. After feature extraction, LRIB-guided representation learning modules act as the noise filter at both unimodal and multimodal levels. In addition, attention bottleneck fusion sifts information to produce a unified and compact representation.
  • Figure 4: The empirical LRIB curve found by minimizing the LRIB Lagrangian of DIB model on CMU-MOSI dataset with varying $\beta$.
  • Figure 5: Attention bottleneck fusion module.The process enables iterative information flow, where cross-modal information is first aggregated into bottleneck embeddings, and then redistributed to enhance modality-specific representations.
  • ...and 12 more figures

Theorems & Definitions (8)

  • Definition 4.1: Low Rank Rényi's Entropy
  • Definition 4.2: LRIB
  • Proposition 4.3
  • proof
  • proof
  • proof
  • proof
  • proof