Is CLIP ideal? No. Can we fix it? Yes!
Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona
TL;DR
The paper analyzes the geometry of CLIP's joint image–text latent space and proves that no CLIP-like space can simultaneously satisfy all essential semantic properties (content description, attribute binding, spatial relations, and negation). To address this, it introduces Dense Cosine Similarity Maps (DCSMs) with Functional Rows to retain topological information from patch and token embeddings while providing a richer similarity score. Empirical results across multiple benchmarks show that DCSMs consistently outperform CLIP-like baselines, and experiments with natural language generalization via LLMs indicate potential for open vocabulary expansion. The work offers a principled critique of CLIP's geometry and provides a lightweight, interpretable augmentation that improves multimodal reasoning without full retraining, with future work aiming to scale and extend natural language coverage.
Abstract
Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP
