The Hidden Pitfalls of the Cosine Similarity Loss
Andrew Draganov, Sharvaree Vadgama, Erik J. Bekkers
TL;DR
The paper reveals that the cosine-similarity loss used in self-supervised learning can produce vanishing gradients in regimes of large embedding norms or antipodal positive pairs, and, counterintuitively, optimizing this loss drives embeddings to grow in magnitude, creating a convergence catch-22. It derives general, architecture-agnostic gradient expressions and shows these dynamics extend to InfoNCE losses, supported by empirical evidence of embedding-norm growth and slower training without normalization. To mitigate the issue, it introduces cut-initialization, a simple pretraining weight-scaling technique, recommending $c=3$ for contrastive and $c=9$ for non-contrastive methods, which, together with $\\ell_2$-normalization, accelerates convergence across multiple SSL architectures and datasets. The work also analyzes the opposite-halves effect and concludes it has limited practical impact, reinforcing the practical value of proper normalization and initialization strategies in SSL.
Abstract
We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.
