Table of Contents
Fetching ...

Is the Modality Gap a Bug or a Feature? A Robustness Perspective

Rhea Chowers, Oshri Naparstek, Udi Barzelay, Yair Weiss

Abstract

Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.

Is the Modality Gap a Bug or a Feature? A Robustness Perspective

Abstract

Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we show that under certain conditions, minimizing the contrastive loss yields a representation in which the two modalities are separated by a global gap vector that is orthogonal to their embeddings. We also show that under these conditions the modality gap is monotonically related to robustness: decreasing the gap does not change the clean accuracy of the models but makes it less likely that a model will change its output when the embeddings are perturbed. Our experiments show that for many real-world VLMs we can significantly increase robustness by a simple post-processing step that moves one modality towards the mean of the other modality, without any loss of clean accuracy.

Paper Structure

This paper contains 33 sections, 5 theorems, 52 equations, 19 figures, 1 algorithm.

Key Result

Theorem 3.1

Let $\mu_x$ and $\mu_y$ be the modalities' empirical means. Assume that $\forall i: \left\lVert y_i-\mu_y\right\rVert \leq \epsilon, \left\lVert x_i-\mu_x\right\rVert \leq \epsilon$, and $\left\lVert\mu_x - \mu_y\right\rVert\gg\epsilon$. Then the gradient of the contrastive loss with respect to the

Figures (19)

  • Figure 1: Left: Projections of CLIPs embedding of the MS-COCO validation set onto its first 3 principal components. A clear separation between images and texts is evident. Right: An image from the Imagenet imagenet validation set that's misclassified by CLIP when changing the caption template. Multi-modal models can lose more than $6\%$ of their accuracy when replacing the caption template.
  • Figure 2: Is the modality gap a bug or a feature? Changing the gap by moving the text embeddings by $\alpha\cdot\vec{g}$ has an inconsistent effect on downstream performance. The figure follows liang2022mind and shows that for some datasets, models benefit from slightly enlarging the gap (\ref{['fig:naive_gapinet']}), some from maintaining it (\ref{['fig:naive_gapcoco']}) .
  • Figure 3: Three points in each modality in $\mathbb{R}^2$ and the corresponding multi-modal contrastive loss along with the magnitude of the gradient of the loss. Lines connect true pairs. As long as the points satisfy relative alignment - the true pair of any point is also its nearest neighbor - the loss and the gradient magnitude are close to zero, even when there exists a gap.
  • Figure 4: The evolution of embeddings using gradient descent on the contrastive loss (\ref{['mmloss']}) starting with two tight clusters. Both when using an unnormalized embedding space (top) and when constricting the embeddings to the sphere (bottom) they converge to a solution that has almost zero loss and for which a global gap vector exists and the gap vector is orthogonal to both modalities. We prove that minimizing the contrastive loss will lead to such a solution under certain assumptions. For full training details see \ref{['appendix:simulations']}.
  • Figure 5: (Top:) Two initial embeddings where the bottom embedding is color-coded based on $S^y_i$. $S^y_i$ decreases with distance to the other modality. ( Bottom:) The training dynamics of a toy model initialized with isotropic Gaussians. Training starts by shrinking variance in the direction of the gap according to \ref{['theorem:newtheory']}.
  • ...and 14 more figures

Theorems & Definitions (10)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.4
  • Theorem 3.5
  • proof
  • proof
  • Lemma A.1
  • proof
  • proof
  • proof