Learning Optimal Multimodal Information Bottleneck Representations
Qilong Wu, Yiyang Shao, Jun Wang, Xiaobo Sun
TL;DR
This work tackles the problem of learning truly optimal multimodal information bottleneck representations by addressing two core issues in prior MIB methods: ad hoc regularization and imbalanced task-relevant information across modalities. The authors introduce Optimal Multimodal Information Bottleneck (OMIB), a framework that combines task relevance branches with an optimal multimodal fusion block using cross-attention, trained under a variational objective that enforces sufficiency while minimizing redundancy. A key theoretical contribution is the derivation of a beta bound $M_u$ (and extensions to multiple modalities) that guarantees achievability of the optimal MIB, together with a dynamic per-modality weight $r$ to handle information imbalance; mutual information estimates are obtained via MINE and a variational upper bound is employed for tractable optimization. Empirically, OMIB outperforms state-of-the-art MIB methods across synthetic data and real-world tasks including emotion recognition (CREMA-D), multimodal sentiment analysis (CMU-MOSI), and anomalous tissue detection (8x 10x datasets), while ablation studies confirm the importance of warm-up, cross-attention fusion, and the dynamic weighting strategy. The work provides a solid information-theoretic foundation for principled multimodal fusion and demonstrates practical gains in diverse downstream tasks.
Abstract
Leveraging high-quality joint representations from multimodal data can greatly enhance model performance in various machine-learning based applications. Recent multimodal learning methods, based on the multimodal information bottleneck (MIB) principle, aim to generate optimal MIB with maximal task-relevant information and minimal superfluous information via regularization. However, these methods often set ad hoc regularization weights and overlook imbalanced task-relevant information across modalities, limiting their ability to achieve optimal MIB. To address this gap, we propose a novel multimodal learning framework, Optimal Multimodal Information Bottleneck (OMIB), whose optimization objective guarantees the achievability of optimal MIB by setting the regularization weight within a theoretically derived bound. OMIB further addresses imbalanced task-relevant information by dynamically adjusting regularization weights per modality, promoting the inclusion of all task-relevant information. Moreover, we establish a solid information-theoretical foundation for OMIB's optimization and implement it under the variational approximation framework for computational efficiency. Finally, we empirically validate the OMIB's theoretical properties on synthetic data and demonstrate its superiority over the state-of-the-art benchmark methods in various downstream tasks.
