Table of Contents
Fetching ...

Global Minimizers of Sigmoid Contrastive Loss

Kiril Bangachev, Guy Bresler, Iliyas Noman, Yury Polyanskiy

TL;DR

This paper theoretically explains the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind, and proposes a reparameterization of the sigmoid loss with explicit relative bias.

Abstract

The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call $(\mathsf{m}, \mathsf{b}_{\mathsf{rel}})$-Constellations. $(\mathsf{m}, \mathsf{b}_{\mathsf{rel}})$-Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin $\mathsf{m}$ and relative bias $\mathsf{b}_{\mathsf{rel}}$. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP and CLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.

Global Minimizers of Sigmoid Contrastive Loss

TL;DR

This paper theoretically explains the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind, and proposes a reparameterization of the sigmoid loss with explicit relative bias.

Abstract

The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call -Constellations. -Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin and relative bias . We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP and CLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.

Paper Structure

This paper contains 41 sections, 21 theorems, 117 equations, 21 figures, 6 tables.

Key Result

Theorem 3.1

Suppose that any iterative algorithm produces a sequence $\{U^{(s)}_i\}_{i =1}^N, \{V^{(s)}_i\}_{i =1}^N,t^{(s)}>0, b^{(s)}$ for $s = 1,2, \ldots$ such that Then, there exists some subsequence indexed by $(s_r)_{r=1}^{+\infty}$ such that and there exists some $\mathsf{m}\ge 0$ such that $\{(U_i,V_i)\}_{i= 1}^N,\mathsf{m},\mathsf{b}_{\mathsf{rel}}$ satisfy eq:mrconstellation.

Figures (21)

  • Figure 1: Distribution of inner products between image and text embeddings from the ImageNet validation set using the $B/16$$224\times 224$ SigLIP model available at https://huggingface.co/google/siglip-base-patch16-224.
  • Figure 2: Examples of zero-loss configurations for Sigmoid loss (left) and InfoNCE (right), highlighting the difference in geometries.
  • Figure 3: Region of possible $(\mathsf{m},\mathsf{b}_{\mathsf{rel}})$-Constellations. In red is the impossible region, in which no large configurations are possible (\ref{['thm:upperboundsmargin']}). In green is the region where constellations of exponential size exist (Theorem \ref{['thm:lowerbondsviasphericalcodes']} and Theorem \ref{['thm:upperboundconst']}). In the shaded region we prove that a modality gap exists (Theorem \ref{['thm:sperablemodalities']}).
  • Figure 4: Modality gap in SigLIP on ImageNet data with the B/16 model with $224\times 224$ resolution. We find a perfect linear separator using the perceptron algorithm.
  • Figure 5: Implicit adapter in relative bias parameterization of sigmoid loss with a locked representation. The parameters $\phi, \delta, t,b$ in green blocks are trainable. Parameter $\theta$ is locked.
  • ...and 16 more figures

Theorems & Definitions (39)

  • Theorem 3.1: All Global Minima are $(\mathsf{m},\mathsf{b}_{\mathsf{rel}})$-Constellations
  • Theorem 3.2: All $(\mathsf{m},\mathsf{b}_{\mathsf{rel}})$-Constellations Are Global Minimizers
  • Corollary 1: Nearest Neighbor Search Yields Perfect Retrieval
  • Proposition 1: Robustness of Retrieval via Nearest Neighbor Search
  • Theorem 3.3: Lower Bound on the Size of Constellations
  • proof
  • Theorem 3.4: Upper Bounds on Margin via Relative Bias
  • proof
  • Theorem 3.5: Upper Bound on the Size of Constellations
  • Theorem 3.6: Modality Gap in Zero-Loss Configurations
  • ...and 29 more