Table of Contents
Fetching ...

Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations

Carolin Cissee, Raneen Younis, Zahra Ahmadi

TL;DR

This work tackles multimodal representation learning by decomposing information into redundant, unique, and synergistic components using Partial Information Decomposition. The proposed COrAL framework employs a dual-path encoder with orthogonality constraints to separate shared and modality-specific information, and integrates asymmetric masking to actively promote cross-modal synergy without extra branches. Empirical results on synthetic data and MultiBench benchmarks show that COrAL recovers modality-unique cues effectively, maintains redundant and synergistic information, and achieves state-of-the-art or competitive downstream performance with low variance across runs. The approach offers a principled, robust pathway toward richer, more interpretable multimodal embeddings and suggests promising directions for scaling to more modalities and higher-order interactions.

Abstract

Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of information components. To promote synergy modeling, we introduce asymmetric masking with complementary view-specific patterns, compelling the model to infer cross-modal dependencies rather than rely solely on redundant cues. Extensive experiments on synthetic benchmarks and diverse MultiBench datasets demonstrate that COrAL consistently matches or outperforms state-of-the-art methods while exhibiting low performance variance across runs. These results indicate that explicitly modeling the full spectrum of multimodal information yields more stable, reliable, and comprehensive embeddings.

Orthogonalized Multimodal Contrastive Learning with Asymmetric Masking for Structured Representations

TL;DR

This work tackles multimodal representation learning by decomposing information into redundant, unique, and synergistic components using Partial Information Decomposition. The proposed COrAL framework employs a dual-path encoder with orthogonality constraints to separate shared and modality-specific information, and integrates asymmetric masking to actively promote cross-modal synergy without extra branches. Empirical results on synthetic data and MultiBench benchmarks show that COrAL recovers modality-unique cues effectively, maintains redundant and synergistic information, and achieves state-of-the-art or competitive downstream performance with low variance across runs. The approach offers a principled, robust pathway toward richer, more interpretable multimodal embeddings and suggests promising directions for scaling to more modalities and higher-order interactions.

Abstract

Multimodal learning seeks to integrate information from heterogeneous sources, where signals may be shared across modalities, specific to individual modalities, or emerge only through their interaction. While self-supervised multimodal contrastive learning has achieved remarkable progress, most existing methods predominantly capture redundant cross-modal signals, often neglecting modality-specific (unique) and interaction-driven (synergistic) information. Recent extensions broaden this perspective, yet they either fail to explicitly model synergistic interactions or learn different information components in an entangled manner, leading to incomplete representations and potential information leakage. We introduce \textbf{COrAL}, a principled framework that explicitly and simultaneously preserves redundant, unique, and synergistic information within multimodal representations. COrAL employs a dual-path architecture with orthogonality constraints to disentangle shared and modality-specific features, ensuring a clean separation of information components. To promote synergy modeling, we introduce asymmetric masking with complementary view-specific patterns, compelling the model to infer cross-modal dependencies rather than rely solely on redundant cues. Extensive experiments on synthetic benchmarks and diverse MultiBench datasets demonstrate that COrAL consistently matches or outperforms state-of-the-art methods while exhibiting low performance variance across runs. These results indicate that explicitly modeling the full spectrum of multimodal information yields more stable, reliable, and comprehensive embeddings.
Paper Structure (22 sections, 8 equations, 5 figures, 5 tables)

This paper contains 22 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: COrAL Architecture for multimodal input $X=(X_1,\ldots, X_n)$. Each modality is processed by a shared pathway $F^{SR}(\cdot)$ for an embedding $Z^{SR}$, which captures synergistic and redundant information, and by a modality-specific unique pathway $F^{U}(\cdot)$ to produce $Z^{U}_1,\ldots,Z^{U}_n$. At inference, neither augmentation nor masking is applied.
  • Figure 2: COrAL asymmetric masking strategy for two views $X^\prime$ and $X^{\prime\prime}$ of a bimodal input $X = (X_1, X_2)$ after encoding and latent conversion $E^{SR}_i(\cdot)$.
  • Figure 3: Schematic of the loss calculations between the embeddings of two views $X^\prime$ and $X^{\prime\prime}$ for bimodal input, as defined by the loss components of COrAL. $Z^{SR\prime}$ and $Z^{SR\prime\prime}$ are the embeddings containing information shared between modalities and $Z^{U\prime}_1$, $Z^{U\prime}_2$, $Z^{U\prime\prime}_1$, and $Z^{U\prime\prime}_2$ are the embeddings of modality-unique information. Arrows with $\text{CEL}(\cdot)$ indicate that cosine embedding loss is calculated between embeddings, and arrows with $\hat{I}_{\text{NCE}}(\cdot)$ refer to the InfoNCE estimator of mutual information.
  • Figure 4: Sensitivity analysis of COrAL: Impact of loss weight variations on linear probing accuracy across MultiBench datasets after 100 epochs. For each of the three loss components $\mathcal{L}_{\text{orthogonal}}$, $\mathcal{L}_{\text{shared}}$, and $\mathcal{L}_{\text{unique}}$ we plot the changes in accuracy for different values of their $\lambda$, which weigh their influence on the overall loss $\mathcal{L}_{\text{COrAL}}$.
  • Figure 5: Visualization of the density of the shared and unique representations of visual and textual information from the MOSI dataset after projection into the embedding space using UMAP. Contour lines for training data density are shown in lighter colors, and contour lines for test data density are shown in darker colors.

Theorems & Definitions (3)

  • Definition 1: Redundancy
  • Definition 2: Synergy
  • Definition 3: Uniqueness