Table of Contents
Fetching ...

Embedding Geometries of Contrastive Language-Image Pre-Training

Jason Chuan-Chih Chou, Nahid Alam

TL;DR

Variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.

Abstract

Since the publication of CLIP, the approach of using InfoNCE loss for contrastive pre-training has become widely popular for bridging two or more modalities. Despite its wide adoption, CLIP's original design choices of L2 normalization and cosine similarity logit have rarely been revisited. We have systematically experimented with alternative geometries and softmax logits for language-image pre-training and identified that variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.

Embedding Geometries of Contrastive Language-Image Pre-Training

TL;DR

Variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.

Abstract

Since the publication of CLIP, the approach of using InfoNCE loss for contrastive pre-training has become widely popular for bridging two or more modalities. Despite its wide adoption, CLIP's original design choices of L2 normalization and cosine similarity logit have rarely been revisited. We have systematically experimented with alternative geometries and softmax logits for language-image pre-training and identified that variants with intuitive Euclidean geometry, Euclidean CLIP (EuCLIP), match or exceed the performance of CLIP and support hierarchical relationships at least as well as more complicated hyperbolic alternative.
Paper Structure (34 sections, 16 equations, 8 figures, 8 tables)

This paper contains 34 sections, 16 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Euclidean entailment loss in $\mathbb{R}^2$, where $\mathbf{O}$ is the origin and $K$ is the minimum radius. For $\mathbf{x}$ on line $y = K$ its half-aperture $\text{aper}(\mathbf{x}) = \sin^{-1}(K/\lVert \mathbf{x} \rVert)$ is equal to the angle between line $\mathbf{O} \mathbf{x}$ and the x-axis. $y = K$ therefore forms one side of the entailment cone and by symmetry the entailment cone for $\mathbf{x} = (K, K)$ is simply a shifted quadrant. For $\mathbf{y}$ out of the entailment cone the entailment loss is $\text{ext}(\mathbf{x}, \mathbf{y}) - \text{aper}(\mathbf{x}), \text{ext}(\mathbf{x}, \mathbf{y}) = \pi - \angle \mathbf{O} \mathbf{x} \mathbf{y}$. For $\mathbf{y'}$ within the entailment cone the entailment loss is zero.
  • Figure 2: Distribution of embedding distances for ViT-B/16 Models. For EuCLIP and MERU the distances are from the origin $\mathbf{O}$ and for CLIP the distances are from [ROOT], the average of all text and image embeddings after L2 normalization. Note that this scaled "cosine distance" $\in [0, 1]$ even though most of the embeddings are no further than $0.5$ from the root, replicating the cone effectModalityGap.
  • Figure 3: Example images from the MERU repository.
  • Figure 4: Distribution of embedding distances from the origin $\mathbf{O}$ for ViT-B/32 Models, EuCLIP (left) vs. MERU (right) and with $\lambda = 0$ (upper) vs. $\lambda > 0$ (lower). As the upper panels show, text and image embeddings do not spontaneously separate. Such "modality gap" Ramasinghe2024 only emerges with entailment loss.
  • Figure 5: Distribution of number of captions retrieved.
  • ...and 3 more figures