Table of Contents
Fetching ...

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

Jonas Herzog, Yue Wang

Abstract

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

Abstract

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.
Paper Structure (18 sections, 12 equations, 10 figures, 3 tables)

This paper contains 18 sections, 12 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Previous work illustrated an intra-modal misalignment in CLIP space by showing there are cat images closer to a dog ($d_{\neq}$) than to another cat ($d_{=}$). We argue $d_{\neq} < d_=$ is no sign of misalignment. For a labeled downstream dataset, intra-class variance of open vocabulary models is expected and desired to capture semantics and style beyond the narrow dataset-specific labels. Classifying and retrieving with frozen CLIP image embeddings still works well when similarities are measured along the dataset-specific semantic axes. Here the horizontal axis captures dog/cat.
  • Figure 2: Pairwise cosine similarity distributions. Left: Similarities between same class (blue) and opposite class (orange) image feature pairs. A high overlap ratio between the two colors was previously highlighted as an indicator for an intra-modal misalignment issue in CLIP. Right: Similarity distributions of image-text pairs (purple) versus image-image pairs (green). Because CLIP is only supervised on the former, the divergence has previously prompted concerns about whether the latter reflect true similarities. CLIP ViT-B/16. Dataset as in \ref{['tab:catdog_retrieval']}.
  • Figure 3: Motivated by the intra-modal misalignment hypothesis, previous work ctg posited it is necessary to convert image-image comparison (left) into image-text comparison (right).
  • Figure 4: (a)-(c): The previous intra-modal degree-of-freedom argument in susx illustrates that two image embeddings can be either close together (a) or far apart (b) while having the same text-image distance $r$, concluding that two image embeddings can lie on any two arbitrary points on the circumference (c), leaving a degree of freedom for image-image miscalibration. Our interpretation (d)-(f): The previous line of argumentation overlooks that each image embedding is bound to more than one text anchor (f). Moreover, the two different configurations in (a) and (b) are not arbitrary, but have a good reason to exist: Images in (a,d) and (b,e) have equal distance $r$ to the "cat" text, but the two images in (a,d) are much more similar to each other than those in (b,e). Displayed distance values are real measurements.
  • Figure 5: Cosine similarity histograms by class (left) and by modality (right). The distributions are almost identical for purely text-image trained SigLIP (first row) and SigLIP2 (second row) which includes an image-image self-supervised objective as in the DINO line of work. This indicates intra-class variation (left) and the gap between text and image embeddings (right) are no signs of misalignment brought by pure text-image training, but rather normal behavior. All models are ViT-B. Embeddings are sampled from ImageNet validation set.
  • ...and 5 more figures