The Double-Ellipsoid Geometry of CLIP
Meir Yossef Levi, Guy Gilboa
TL;DR
The paper analyzes CLIP's pre-normalized embeddings and reveals a double-ellipsoid geometry where images and text sit on separate, offset ellipsoids with thin-shell mass concentration around radius $\sqrt{n}$. It introduces conformity, a measure of a sample's average cosine similarity to others, and shows a fast estimator based on cosine to the mean vector, linking conformity to semantic commonality and explaining the modality gap. By decomposing the NT-Xent loss into alignment and uniformity, it explains why non-origin-centered ellipsoids optimize the contrastive objective and how false negatives relate to uncertainty. The work further demonstrates practical uses, including a training-free interpolation method (vSLERP) for semantic editing and a conformity-based lens for evaluating generative method expressiveness, with broad implications for understanding multimodal latent spaces and guiding downstream tasks.
Abstract
Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood. We examine the raw unnormalized embedding and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.
