Table of Contents
Fetching ...

The Hidden Pitfalls of the Cosine Similarity Loss

Andrew Draganov, Sharvaree Vadgama, Erik J. Bekkers

TL;DR

The paper reveals that the cosine-similarity loss used in self-supervised learning can produce vanishing gradients in regimes of large embedding norms or antipodal positive pairs, and, counterintuitively, optimizing this loss drives embeddings to grow in magnitude, creating a convergence catch-22. It derives general, architecture-agnostic gradient expressions and shows these dynamics extend to InfoNCE losses, supported by empirical evidence of embedding-norm growth and slower training without normalization. To mitigate the issue, it introduces cut-initialization, a simple pretraining weight-scaling technique, recommending $c=3$ for contrastive and $c=9$ for non-contrastive methods, which, together with $\\ell_2$-normalization, accelerates convergence across multiple SSL architectures and datasets. The work also analyzes the opposite-halves effect and concludes it has limited practical impact, reinforcing the practical value of proper normalization and initialization strategies in SSL.

Abstract

We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.

The Hidden Pitfalls of the Cosine Similarity Loss

TL;DR

The paper reveals that the cosine-similarity loss used in self-supervised learning can produce vanishing gradients in regimes of large embedding norms or antipodal positive pairs, and, counterintuitively, optimizing this loss drives embeddings to grow in magnitude, creating a convergence catch-22. It derives general, architecture-agnostic gradient expressions and shows these dynamics extend to InfoNCE losses, supported by empirical evidence of embedding-norm growth and slower training without normalization. To mitigate the issue, it introduces cut-initialization, a simple pretraining weight-scaling technique, recommending for contrastive and for non-contrastive methods, which, together with -normalization, accelerates convergence across multiple SSL architectures and datasets. The work also analyzes the opposite-halves effect and concludes it has limited practical impact, reinforcing the practical value of proper normalization and initialization strategies in SSL.

Abstract

We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.

Paper Structure

This paper contains 15 sections, 5 theorems, 18 equations, 5 figures, 6 tables.

Key Result

proposition 1

Let $z_i$ and $z_j$ be two points in $\mathbb{R}^d$ and define $\mathcal{L}_i^\mathcal{A}(\mathbf{Z}) = -\hat{z}_i^\top \hat{z}_j$. Let $\phi_{ij}$ be the angle between $z_i$ and $z_j$. Then the gradient of $\mathcal{L}_i^\mathcal{A}(\mathbf{Z})$ w.r.t. $z_i$ is where $a_{\perp b}$ is the component of $a$ orthogonal to $b$. This has magnitude $||\nabla_i^\mathcal{A}|| = \frac{\sin(\phi_{ij})}{||z

Figures (5)

  • Figure 1: Left: The gradients w.r.t. $z_i$ in Proposition \ref{['prop:cos_sim_grads']} and Corollary \ref{['cor:infonce_grads']} exclusively exist in the tangent space at $\vec{z}_i$. Right: The growing embeddings in Corollary \ref{['cor:embeddings_grow']}. Blue points represent $z_i$ at iterations $t = 1, 2, 3$. Yellow points represent $z_i'$, i.e. the result of each step of gradient descent.
  • Figure 2: The effect of the embedding norm and angle between positive samples on the convergence rate.
  • Figure 3: Left: The embedding norms when optimizing the subsets of the InfoNCE loss function for SimCLR. Right: The mean embedding norms for SimCLR/SimSiam/BYOL as a function of the weight-decay. Note, we use the terms $\ell_2$-normalization and weight-decay interchangeably.
  • Figure 4: Lines between positive samples in 2D latent space during training. The color goes from red to blue as the cos. sim. goes from $-1$ to $1$.
  • Figure 5: The effect of cut-initialization on Cifar10 SSL representations. $x$-axis and embedding norm's $y$-axis are log-scale. $\lambda=5$e$-4$ in all experiments.

Theorems & Definitions (10)

  • proposition 1: Prop. 3 in spherical_embeddings; proof in \ref{['prf:prop_grad_grows']}
  • corollary 1: Proof in \ref{['prf:cor_embeddings_grow']}
  • theorem 1: Proof in \ref{['prf:thm_convergence_rate']}
  • proof
  • proof
  • proof
  • corollary 2
  • proof
  • proposition 2
  • proof