Table of Contents
Fetching ...

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

Can Yaras, Siyi Chen, Peng Wang, Qing Qu

TL;DR

This work tackles the modality gap observed in contrastive multimodal learning, particularly CLIP, by analyzing training dynamics through gradient-flow. It shows the gap persists and decays only at a slow rate $\Omega(1/\log(t)^2)$ due to a coupling between learned temperature and data mismatches, and demonstrates how a learnable temperature can hinder gap closure. Leveraging these insights, it proposes principled mitigation strategies—temperature scheduling, temperature reparameterization, and modality swapping—that reduce the gap and improve image-text retrieval, while noting that uniformity and other metrics govern performance on other tasks. The findings provide a theoretical framework and practical guidelines for designing more effective multimodal representations with improved cross-modal retrieval capabilities.

Abstract

Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language-Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms underlying multimodal learning are not yet well understood. Notably, these models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training. Furthermore, our theoretical insights are validated through experiments on practical CLIP models. These findings provide principled guidance for mitigating the modality gap, including strategies such as appropriate temperature scheduling and modality swapping. Additionally, we demonstrate that closing the modality gap leads to improved performance on tasks such as image-text retrieval.

Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning

TL;DR

This work tackles the modality gap observed in contrastive multimodal learning, particularly CLIP, by analyzing training dynamics through gradient-flow. It shows the gap persists and decays only at a slow rate due to a coupling between learned temperature and data mismatches, and demonstrates how a learnable temperature can hinder gap closure. Leveraging these insights, it proposes principled mitigation strategies—temperature scheduling, temperature reparameterization, and modality swapping—that reduce the gap and improve image-text retrieval, while noting that uniformity and other metrics govern performance on other tasks. The findings provide a theoretical framework and practical guidelines for designing more effective multimodal representations with improved cross-modal retrieval capabilities.

Abstract

Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language-Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms underlying multimodal learning are not yet well understood. Notably, these models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training. Furthermore, our theoretical insights are validated through experiments on practical CLIP models. These findings provide principled guidance for mitigating the modality gap, including strategies such as appropriate temperature scheduling and modality swapping. Additionally, we demonstrate that closing the modality gap leads to improved performance on tasks such as image-text retrieval.

Paper Structure

This paper contains 56 sections, 6 theorems, 51 equations, 7 figures, 5 tables.

Key Result

Lemma 3.1

Let $\nu(t)$ and $\gamma(t)$ be solutions to the gradient flow dynamics given in eq:grad_flow. Then we have for all $t\geq 0$, where $\beta'(\nu)$ denotes the derivative of $\beta$ with respect to $\nu$. Moreover, given $\beta(\nu)=\exp(\nu)$, we have $\beta'(\nu) = \beta$ and the following holds:

Figures (7)

  • Figure 1: Stabilization and enlargement of modality gap. We visualize the CLIP text-image embedding space via PCA, where image features are in red and text features are in blue. A line connects each image-text pair. A modality gap emerges between image and text pairs. (a) After a long training, the gap between text and image still exists. (b) When two modalities are initialized with a small modality gap, the gap is still enlarged after training.
  • Figure 2: Dynamics of modality gap $\Delta$ and temperature $\tau$ (defined in \ref{['sec:problem_setup']}) during training on synthetic data. Features from the two modalities are depicted in red and blue, with ground truth pairs connected by lines. At $t=4.0$, all pairs are successfully matched. Initially, modality gap increases due to significant mismatches between pairs, but it decreases as the level of mismatch diminishes. Notably, modality gap and temperature exhibit highly coupled dynamics throughout the learning process.
  • Figure 3: Parallel modalities with or without mismatched pairs. Ground truth pairs are connected by a dashed line.
  • Figure 4: Verifying theoretical results. We sample 2048 random pairs of MSCOCO examples and utilize a standard CLIP model to (a) plot $\Delta$ throughout training from scratch and (b) plot a histogram of the change in $\Delta$ across 100 initializations in the first 10 steps of training. More details regarding the experimental setup can be found in \ref{['appendix:exp']}.
  • Figure 5: We propose two categories of methods: Control Temperature and Swap Modality. (a) Our methods reduce modality gap and may influence the uniformity of the feature space. (b) Control Temperature maintains the temperature at larger values such as increasing temperature across training. (c) Swap Modalities swaps between image and text feature pairs.
  • ...and 2 more figures

Theorems & Definitions (12)

  • Lemma 3.1
  • Theorem 3.2
  • Theorem 3.3
  • proof
  • Lemma B.1
  • proof
  • proof : Proof of \ref{['thm:1']}
  • Lemma B.2
  • proof
  • Lemma B.3
  • ...and 2 more