InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions
Liangjian Wen, Qun Dai, Jianzhuang Liu, Jiangtao Zheng, Yong Dai, Dongkai Wang, Zhao Kang, Jun Wang, Zenglin Xu, Jiang Duan
TL;DR
InfMasking addresses the challenge of capturing synergistic information in multimodal representation learning by stochastically masking large portions of each modality during fusion and aligning masked and unmasked fused representations through mutual information. The method introduces a tractable InfMasking loss derived from an infinite masking paradigm, using a Gaussian-based lower bound to approximate the intractable expectation. Empirically, InfMasking achieves state-of-the-art performance across seven real-world multimodal benchmarks, including bimodal and trimodal setups, and reveals strong synergy, redundancy, and uniqueness handling in controlled synthetic data. The work demonstrates robust improvements in both synthetic and real datasets, suggesting broad applicability and motivating future theoretical foundations for synergistic information in multimodal learning.
Abstract
In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.
