What to align in multimodal contrastive learning?

Benoit Dufumier; Javiera Castillo-Navarro; Devis Tuia; Jean-Philippe Thiran

What to align in multimodal contrastive learning?

Benoit Dufumier, Javiera Castillo-Navarro, Devis Tuia, Jean-Philippe Thiran

TL;DR

This work addresses learning unified representations from multiple modalities beyond redundancy by proposing CoMM, a Contrastive Multimodal learning method grounded in Partial Information Decomposition (PID). CoMM uses modality-specific encoders, latent converters, and a transformer fusion to produce a single multimodal representation Z_ heta and optimizes a mutual-information-based loss via the InfoNCE estimator to capture redundancy $R$, uniqueness $U$, and synergy $S$, with $I(X_1,X_2;Y)=R+S+U_1+U_2$ and $I(Z'_{ heta^ op};Y)=I(X';Y)$. Empirically, CoMM achieves state-of-the-art results on seven real-world multimodal benchmarks across two and three modalities, and demonstrates strong modeling of multimodal interactions in controlled and diverse domains (robotics, healthcare, multimedia). The approach offers a versatile, task-agnostic framework for multimodal representation learning with broad practical impact and clear avenues for future work on augmentations and disentanglement.

Abstract

Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive MultiModal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these multimodal features. Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy. We test CoMM both in a controlled and in a series of real-world settings: in the former, we demonstrate that CoMM effectively captures redundant, unique and synergistic information between modalities. In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on the seven multimodal benchmarks. Code is available at https://github.com/Duplums/CoMM

What to align in multimodal contrastive learning?

TL;DR

, uniqueness

, and synergy

, with

and

. Empirically, CoMM achieves state-of-the-art results on seven real-world multimodal benchmarks across two and three modalities, and demonstrates strong modeling of multimodal interactions in controlled and diverse domains (robotics, healthcare, multimedia). The approach offers a versatile, task-agnostic framework for multimodal representation learning with broad practical impact and clear avenues for future work on augmentations and disentanglement.

Abstract

Paper Structure (34 sections, 3 theorems, 11 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 34 sections, 3 theorems, 11 equations, 7 figures, 8 tables, 1 algorithm.

Introduction
Quantifying multimodal interactions
CoMM: Contrastive Multimodal learning
Towards effective multimodal representations
Multimodal architecture
Training
Experiments
Controlled experiments on the bimodal Trifeature dataset
Experiments with 2 modalities on real-world datasets
Experiments with 3 modalities on real-world datasets
Ablation studies
Related work
Conclusions
Limitations and future research
Implementation details
...and 19 more sections

Key Result

Lemma 1

Under the multiview redundancy assumption, cross-modal contrastive learning methods are limited to only learn the redundant information R.

Figures (7)

Figure 1: a) We propose CoMM, a contrastive multimodal approach that allows the interplay of multiple modalities and learns multimodal interactions. Unlike previous multimodal models (Cross) that align cross-modal features, CoMM aligns multimodal features in a shared representation space. b) Multimodal interactions are task-dependent, thus a model needs to capture all of them to generalize to any multimodal task. CoMM's new paradigm captures multimodal interactions beyond redundancy.
Figure 2: CoMM's model architecture. Inputs from different modalities $X=(X_1, ...,X_n)$ are first encoded by modality-specific encoders. Modality-specific features are processed by latent converters to map them into sequences of embeddings which are concatenated and fused by a transformer block. The output is a single multimodal feature$\mathbf{Z_\theta}$.
Figure 3: CoMM training for $n=2$. Two multimodal augmentations are applied to $X$ to obtain $X'$ and $X"$. We also consider the projection operators to get $\{X_i\}_{i=1}^n$. These $n+2$ transformed versions of $X$ are processed by the network $f_\theta$, trained to maximize the agreement between these $n+2$ terms using contrastive objectives.
Figure 4: Linear probing accuracy of redundancy (shape), uniqueness (texture) and synergy (color and texture) on bimodal Trifeature dataset. CoMM is the only model capturing all three task-related interactions between modalities.
Figure 5: Linear probing accuracy of redundancy $R$, uniqueness $U=\frac{1}{n}\sum_{i=1}^n U_i$ and synergy $S$ on bimodal Trifeature when optimizing each term separately in $\mathcal{L}_{\text{CoMM}}$. Minimizing $\mathcal{L}_i$ allows to learn $U_i$ and $R$, approximating $I(X_i; Y)$ for $i\in \{1,...,n\}$. Optimizing $\mathcal{L}=-\hat{I}(Z', Z")$ allows to slowly learn $R$, $U_i$ and $S$. CoMM quickly captures all information.
...and 2 more figures

Theorems & Definitions (7)

Definition 1: Multi-view redundancy
Lemma 1
Lemma 2
Lemma 3
Proof 1: \ref{['lemma: CLIP-redundancy']}
Proof 2: \ref{['lemma: Ixxp']}
Proof 3: \ref{['cor: Ix1z']}

What to align in multimodal contrastive learning?

TL;DR

Abstract

What to align in multimodal contrastive learning?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (7)