Table of Contents
Fetching ...

It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Abrar Fahim, Alex Murphy, Alona Fyshe

TL;DR

The paper reframes the observed separation between image and text embeddings in CLIP as a contrastive gap generated by the training objective rather than a modality-specific deficiency. It introduces explicit uniformity and alignment terms across and within modalities, yielding augmented losses $L_{ ext{CUA}}$ and $L_{ ext{CUAXU}}$ that distribute embeddings more evenly on the unit sphere and tighten positive image–text alignment. Empirically, these changes shrink the gap on MS COCO, maintain retrieval performance, and improve zero-shot transfer and multimodal arithmetic across several datasets, demonstrating practical benefits of improved representational geometry. The work highlights the importance of uniformity in high-dimensional multimodal spaces and suggests directions for scaling to larger data regimes and exploring additional embedding-space quality metrics.

Abstract

Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.

It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

TL;DR

The paper reframes the observed separation between image and text embeddings in CLIP as a contrastive gap generated by the training objective rather than a modality-specific deficiency. It introduces explicit uniformity and alignment terms across and within modalities, yielding augmented losses and that distribute embeddings more evenly on the unit sphere and tighten positive image–text alignment. Empirically, these changes shrink the gap on MS COCO, maintain retrieval performance, and improve zero-shot transfer and multimodal arithmetic across several datasets, demonstrating practical benefits of improved representational geometry. The work highlights the importance of uniformity in high-dimensional multimodal spaces and suggests directions for scaling to larger data regimes and exploring additional embedding-space quality metrics.

Abstract

Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.
Paper Structure (25 sections, 5 equations, 5 figures, 13 tables)

This paper contains 25 sections, 5 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Visualizing the training stages of 3D CLIP on 1000 image-text pairs from MS COCO. Red points are image embeddings, and blue points are text embeddings. $I \rightarrow T$ accuracy represents the text retrieval accuracies: Higher $I \rightarrow T$ accuracies mean that positive pairs are well contrasted in the latent space relative to the negative pairs. The embeddings of each modality are initialized to reside in separate cones due to the cone effect, before they form arcs and then eventually merge together as rings. Finally, they spread out to fill the sphere.
  • Figure 2: Gap metrics on MS COCO validation dataset. Recall: the gap closes when linear separability $\sim 0.5$ and centroid distance is small. The size of the gap is much smaller with uniformity and alignment terms included.
  • Figure 2: Explained variances for all principle components of the 128D latent space for several losses.
  • Figure 3: Average zero-shot classification accuracies for fine-tuned CLIP on the different losses. CLIP losses with uniformity and alignment terms added consistently get better zero-shot accuracies than default fine-tuned CLIP on the same dimensionality.
  • Figure 4: PCA explained variances for each CLIP dimensionality after fine-tuning