Table of Contents
Fetching ...

InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

Liangjian Wen, Qun Dai, Jianzhuang Liu, Jiangtao Zheng, Yong Dai, Dongkai Wang, Zhao Kang, Jun Wang, Zenglin Xu, Jiang Duan

TL;DR

InfMasking addresses the challenge of capturing synergistic information in multimodal representation learning by stochastically masking large portions of each modality during fusion and aligning masked and unmasked fused representations through mutual information. The method introduces a tractable InfMasking loss derived from an infinite masking paradigm, using a Gaussian-based lower bound to approximate the intractable expectation. Empirically, InfMasking achieves state-of-the-art performance across seven real-world multimodal benchmarks, including bimodal and trimodal setups, and reveals strong synergy, redundancy, and uniqueness handling in controlled synthetic data. The work demonstrates robust improvements in both synthetic and real datasets, suggesting broad applicability and motivating future theoretical foundations for synergistic information in multimodal learning.

Abstract

In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.

InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions

TL;DR

InfMasking addresses the challenge of capturing synergistic information in multimodal representation learning by stochastically masking large portions of each modality during fusion and aligning masked and unmasked fused representations through mutual information. The method introduces a tractable InfMasking loss derived from an infinite masking paradigm, using a Gaussian-based lower bound to approximate the intractable expectation. Empirically, InfMasking achieves state-of-the-art performance across seven real-world multimodal benchmarks, including bimodal and trimodal setups, and reveals strong synergy, redundancy, and uniqueness handling in controlled synthetic data. The work demonstrates robust improvements in both synthetic and real datasets, suggesting broad applicability and motivating future theoretical foundations for synergistic information in multimodal learning.

Abstract

In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.

Paper Structure

This paper contains 26 sections, 4 theorems, 9 equations, 3 figures, 8 tables, 1 algorithm.

Key Result

Lemma 1

When optimizing the function $f_\theta$ to maximize mutual information $I\left(Z_\theta; Z_\theta^{\prime}\right)$, and under the assumption that the network $f_\theta$ possesses sufficient expressivity, we observe that in the optimal parameter configuration: $I(Z_{\theta^{\star}}, Z'_{\theta^{\star

Figures (3)

  • Figure 1: The overall pipeline of InfMasking. Given $n$ modalities $X = (X_1, X_2, \ldots, X_n)$, we augment them to obtain $X'$ and $X"$, which are then encoded independently by modality-specific encoders to extract latent features. These features are processed in three ways: (1) All modality features are concatenated and input into a Transformer block, yielding fused features $Z'$ and $Z"$; (2) Each modality feature is individually input into a Transformer block, producing unimodal features $Z_1, Z_2, \ldots, Z_n$ ; (3) Features of each modality are randomly masked, concatenated, and input into a Transformer block, repeated $k$ times to obtain $Z^1_{\text{mask}}, Z^2_{\text{mask}}, \ldots, Z^k_{\text{mask}}$.
  • Figure 2: Synergy accuracy changes with different masked setting on Trifeature datasets.
  • Figure 3: Visualization of the distribution of multimodal fusion embeddings and its masked counterpars.

Theorems & Definitions (5)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Proof 1: \ref{['lemma: ifif']}
  • Lemma 4