Table of Contents
Fetching ...

The Double-Ellipsoid Geometry of CLIP

Meir Yossef Levi, Guy Gilboa

TL;DR

The paper analyzes CLIP's pre-normalized embeddings and reveals a double-ellipsoid geometry where images and text sit on separate, offset ellipsoids with thin-shell mass concentration around radius $\sqrt{n}$. It introduces conformity, a measure of a sample's average cosine similarity to others, and shows a fast estimator based on cosine to the mean vector, linking conformity to semantic commonality and explaining the modality gap. By decomposing the NT-Xent loss into alignment and uniformity, it explains why non-origin-centered ellipsoids optimize the contrastive objective and how false negatives relate to uncertainty. The work further demonstrates practical uses, including a training-free interpolation method (vSLERP) for semantic editing and a conformity-based lens for evaluating generative method expressiveness, with broad implications for understanding multimodal latent spaces and guiding downstream tasks.

Abstract

Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood. We examine the raw unnormalized embedding and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.

The Double-Ellipsoid Geometry of CLIP

TL;DR

The paper analyzes CLIP's pre-normalized embeddings and reveals a double-ellipsoid geometry where images and text sit on separate, offset ellipsoids with thin-shell mass concentration around radius . It introduces conformity, a measure of a sample's average cosine similarity to others, and shows a fast estimator based on cosine to the mean vector, linking conformity to semantic commonality and explaining the modality gap. By decomposing the NT-Xent loss into alignment and uniformity, it explains why non-origin-centered ellipsoids optimize the contrastive objective and how false negatives relate to uncertainty. The work further demonstrates practical uses, including a training-free interpolation method (vSLERP) for semantic editing and a conformity-based lens for evaluating generative method expressiveness, with broad implications for understanding multimodal latent spaces and guiding downstream tasks.

Abstract

Contrastive Language-Image Pre-Training (CLIP) is highly instrumental in machine learning applications within a large variety of domains. We investigate the geometry of this embedding, which is still not well understood. We examine the raw unnormalized embedding and show that text and image reside on linearly separable ellipsoid shells, not centered at the origin. We explain the benefits of having this structure, allowing to better embed instances according to their uncertainty during contrastive training. Frequent concepts in the dataset yield more false negatives, inducing greater uncertainty. A new notion of conformity is introduced, which measures the average cosine similarity of an instance to any other instance within a representative data set. We show this measure can be accurately estimated by simply computing the cosine similarity to the modality mean vector. Furthermore, we find that CLIP's modality gap optimizes the matching of the conformity distributions of image and text.

Paper Structure

This paper contains 22 sections, 2 theorems, 25 equations, 24 figures.

Key Result

Theorem 3.1

Let the thin shell parameter be defined by where the supremum is over isotropic, log-concave random vectors in $R^n$. Then $\sigma_n \le c (\log n)^\alpha$, where $c$ is a universal constant.

Figures (24)

  • Figure 1: Sketch of CLIP general geometry: image and text are embedded on linearly separable ellipsoid shells, not centered at the origin. This allows to control uncertainty in contrastive learning, where as themes become more rare (lower uncertainty) they reside farther from the mean modality vector.
  • Figure 2: Normalized histograms of certain CLIP features. Image and text are clearly drawn from different statistics. On the right it is shown that even two features are sufficient to obtain full linear separability. The results of a linear SVM classifier are shown (blue dashed line, with $100\%$ accuracy on MS-COCO).
  • Figure 3: Separability of features (left) and 10 most significant features $\ell$ for image and text, with high absolute mean, compared to the feature's standard deviation.
  • Figure 4: Statistics of image and text features after mean subtraction. Top: The first 10 features for image (top) and text (bottom). Bottom: Histograms of $\|\tilde{v}\|$ for images and text, showing a thin-shell phenomenon with no volume below a threshold, typical for high dimensions.
  • Figure 5: Normalized histograms of feature variance (left) show a long tail, indicating an ellipsoid rather than a hypersphere. Off-diagonal dominance (Eq. \ref{['eq:offdiag']}) suggests strong feature correlations, implying a tilted ellipsoid.
  • ...and 19 more figures

Theorems & Definitions (7)

  • Theorem 3.1: Thin shell
  • Definition 1: Conformity
  • Definition 2: Estimated conformity
  • Definition 3: Log concave distribution
  • Definition 4: Isotropic random vector
  • Proposition 1
  • proof