Table of Contents
Fetching ...

Is CLIP ideal? No. Can we fix it? Yes!

Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona

TL;DR

The paper analyzes the geometry of CLIP's joint image–text latent space and proves that no CLIP-like space can simultaneously satisfy all essential semantic properties (content description, attribute binding, spatial relations, and negation). To address this, it introduces Dense Cosine Similarity Maps (DCSMs) with Functional Rows to retain topological information from patch and token embeddings while providing a richer similarity score. Empirical results across multiple benchmarks show that DCSMs consistently outperform CLIP-like baselines, and experiments with natural language generalization via LLMs indicate potential for open vocabulary expansion. The work offers a principled critique of CLIP's geometry and provides a lightweight, interpretable augmentation that improves multimodal reasoning without full retraining, with future work aiming to scale and extend natural language coverage.

Abstract

Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP

Is CLIP ideal? No. Can we fix it? Yes!

TL;DR

The paper analyzes the geometry of CLIP's joint image–text latent space and proves that no CLIP-like space can simultaneously satisfy all essential semantic properties (content description, attribute binding, spatial relations, and negation). To address this, it introduces Dense Cosine Similarity Maps (DCSMs) with Functional Rows to retain topological information from patch and token embeddings while providing a richer similarity score. Empirical results across multiple benchmarks show that DCSMs consistently outperform CLIP-like baselines, and experiments with natural language generalization via LLMs indicate potential for open vocabulary expansion. The work offers a principled critique of CLIP's geometry and provides a lightweight, interpretable augmentation that improves multimodal reasoning without full retraining, with future work aiming to scale and extend natural language coverage.

Abstract

Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP

Paper Structure

This paper contains 31 sections, 5 theorems, 56 equations, 11 figures, 9 tables.

Key Result

Lemma 1

Embeddings of images or texts with two obect concepts must be a linear superposition of the respective single object concept embeddings.

Figures (11)

  • Figure 1: CLIP scores do not accurately reflect semantics of text prompts due to inherent geometric limitations. For five out of six pairs of captions, the incorrect pair has the higher CLIP score with the image. By contrast, our proposed model using Dense Cosine Similarity Maps (DCSM) correctly scores matched pairs. The similarity score is unnormalized because it is predicted by a neural network. CLIP scores computed with OpenAI-CLIP ViT-B/32.
  • Figure 2: Graphical illustration of defined concept sets. Humans can parse visual stimuli from the real world and organize them into Object Concepts $\mathbb{V}$, Attributes which adorn objects $\mathbb{A}$, and Relationships between objects $\mathbb{G}$. These concepts can be ordered into Composed Scenes $\mathbb{S}$. Here, $\mathbb{V},\mathbb{A}, \mathbb{G}, \text{ and }\mathbb{S}$ are strict subsets of the set of all real world concepts. We can communicate these composed concepts via language $\mathbb{T}$, or by taking pictures of exemplars $\mathbb{I}$. These composed language or image modalities can be projected onto the CLIP latent space C via a text encoder t or an image encoder i. Elements in distinct sets with the same color have a one-to-one correspondence.
  • Figure 3: Empirical Dense Cosine Similarity Maps. Each one of four matrices shows a DCSM between a different sentence and the same pictured image. For each subfigure: the y axis shows the text tokens and the x axis varies by image patch. We cluster the patches by region as shown in the image. Each pixel value is the cosine similarity score between that token and patch embedding. Green sentences correctly represent the image and red ones do not.
  • Figure 4: Schematics of our proposed pipeline and training of its scoring function. Every sample seen during training contains one hard positive caption and image pair, and a hard negative caption and image pair. Images and texts are passed through frozen CLIP encoders to compute the DCSMs, and then functional rows (FRs) for compositional words are inserted before the DCSMs are scored.
  • Figure 5: Graphic illustration of retained topology in DCSMs. The image prompt has 4 patches, and each DCSM shows the dense cosine similarity between each of its patches and the text tokens of the text prompts. Sentences under the green checkmark are a correct pair with the image, while those with a red X are incorrect.
  • ...and 6 more figures

Theorems & Definitions (15)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 5 more