Table of Contents
Fetching ...

In Defense of Cosine Similarity: Normalization Eliminates the Gauge Freedom

Taha Bouhsine

TL;DR

It is proved that when embeddings are constrained to the unit sphere $\mathbb{S}^{d-1}$ (either during or after training with an appropriate objective), the $D$-matrix ambiguity vanishes identically, and cosine distance reduces to exactly half the squared Euclidean distance.

Abstract

Steck, Ekanadham, and Kallus [arXiv:2403.05440] demonstrate that cosine similarity of learned embeddings from matrix factorization models can be rendered arbitrary by a diagonal ``gauge'' matrix $D$. Their result is correct and important for practitioners who compute cosine similarity on embeddings trained with dot-product objectives. However, we argue that their conclusion, cautioning against cosine similarity in general, conflates the pathology of an incompatible training objective with the geometric validity of cosine distance on the unit sphere. We prove that when embeddings are constrained to the unit sphere $\mathbb{S}^{d-1}$ (either during or after training with an appropriate objective), the $D$-matrix ambiguity vanishes identically, and cosine distance reduces to exactly half the squared Euclidean distance. This monotonic equivalence implies that cosine-based and Euclidean-based neighbor rankings are identical on normalized embeddings. The ``problem'' with cosine similarity is not cosine similarity, it is the failure to normalize.

In Defense of Cosine Similarity: Normalization Eliminates the Gauge Freedom

TL;DR

It is proved that when embeddings are constrained to the unit sphere (either during or after training with an appropriate objective), the -matrix ambiguity vanishes identically, and cosine distance reduces to exactly half the squared Euclidean distance.

Abstract

Steck, Ekanadham, and Kallus [arXiv:2403.05440] demonstrate that cosine similarity of learned embeddings from matrix factorization models can be rendered arbitrary by a diagonal ``gauge'' matrix . Their result is correct and important for practitioners who compute cosine similarity on embeddings trained with dot-product objectives. However, we argue that their conclusion, cautioning against cosine similarity in general, conflates the pathology of an incompatible training objective with the geometric validity of cosine distance on the unit sphere. We prove that when embeddings are constrained to the unit sphere (either during or after training with an appropriate objective), the -matrix ambiguity vanishes identically, and cosine distance reduces to exactly half the squared Euclidean distance. This monotonic equivalence implies that cosine-based and Euclidean-based neighbor rankings are identical on normalized embeddings. The ``problem'' with cosine similarity is not cosine similarity, it is the failure to normalize.
Paper Structure (19 sections, 6 theorems, 22 equations, 5 figures, 1 table)

This paper contains 19 sections, 6 theorems, 22 equations, 5 figures, 1 table.

Key Result

Theorem 4

For any $\mathbf{x}, \mathbf{y} \in \mathbb{S}^{d-1}$:

Figures (5)

  • Figure 1: Two unit vectors $\mathbf{x}$ and $\mathbf{y}$ on $\mathbb{S}^1$. The gold arc is the geodesic (angular distance $\theta$), and the blue dashed line is the Euclidean chord ($d_E$). The shaded wedge shows the cosine distance region. By Theorem \ref{['thm:equiv']}, $d_C = \frac{1}{2}d_E^2$, so the two distances rank all pairs identically.
  • Figure 2: The equivalence curve. $d_C$ (dashed orange) and $\frac{1}{2}d_E^2$ (solid blue) are identical functions of $\theta$ when both vectors are on $\mathbb{S}^{d-1}$. The curves overlap perfectly --- they describe the same geometric quantity.
  • Figure 3: The effect of the gauge matrix $D$ on two embeddings. Left:$D = I$, no distortion. Center:$D = \mathrm{diag}(2, 0.5)$ stretches the first axis and compresses the second; the cosine similarity increases. Right:$D = \mathrm{diag}(0.3, 3.3)$ compresses the first axis severely; both vectors become nearly aligned, giving cosine $\approx 1$. In all three cases, $\langle \mathbf{b}_1, \mathbf{b}_2 \rangle$ is identical --- the model's predictions are unchanged.
  • Figure 4: Two workflows for using cosine similarity. Path A (the pathology): train with an unconstrained dot-product objective, then normalize post-hoc. The $D$-ambiguity has already been baked in. Path B (the solution): train with an explicit sphere constraint, so the $D$-freedom is never available to the optimizer. Cosine distance is then exactly $\frac{1}{2}d_E^2$.
  • Figure 5: On the sphere $\mathbb{S}^{d-1}$, the geodesic distance (arc length) and the Euclidean chord distance both yield the same nearest-neighbor ranking. The cosine distance $d_C = 1 - \cos\theta$ is a monotone function of both, providing a third equivalent metric for ranking purposes.

Theorems & Definitions (14)

  • Definition 1: Cosine Similarity and Distance
  • Definition 2: Squared Euclidean Distance
  • Definition 3: Gauge Matrix
  • Theorem 4: Cosine--Euclidean Equivalence
  • proof
  • Corollary 5: Monotonic Ranking Equivalence
  • proof
  • Proposition 6: Gauge Freedom, Steck et al. 2024
  • Theorem 7: Normalization Kills the Gauge
  • proof
  • ...and 4 more