Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric

Federico Tessari; Kunpeng Yao; Neville Hogan

Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric

Federico Tessari, Kunpeng Yao, Neville Hogan

TL;DR

The paper addresses the problem that cosine similarity exhibits dimension-dependent bias when comparing high-dimensional vectors. It analyzes how dimensionality affects standard metrics and introduces the Dimension Insensitive Euclidean Metric (DIEM), defined as $\mathrm{DIEM}=\frac{v_M-v_m}{\sigma_{ed}^2}\big(\sqrt{\sum_{i=1}^n (a_i-b_i)^2}-\mathbb{E}[d(n)]\big)$, to detrend distance and stabilize variance across dimensions. The authors provide analytical results for Euclidean distance properties, demonstrate that $\mathbb{E}[\mathrm{DIEM}]=0$ and that its variance is dimension-invariant for $n\ge 7$, and validate DIEM on a large-language-model embedding case where it outperforms cosine similarity in high dimensions. The work offers a potentially general-purpose tool for reliable multidimensional comparisons with broad impact on NLP, ML, and computational neuroscience, while noting limitations for normalized data and encouraging further validation.

Abstract

Advances in computational power and hardware efficiency have enabled tackling increasingly complex, high-dimensional problems. While artificial intelligence (AI) achieves remarkable results, the interpretability of high-dimensional solutions remains challenging. A critical issue is the comparison of multidimensional quantities, essential in techniques like Principal Component Analysis. Metrics such as cosine similarity are often used, for example in the development of natural language processing algorithms or recommender systems. However, the interpretability of such metrics diminishes as dimensions increase. This paper analyzes the effects of dimensionality, revealing significant limitations of cosine similarity, particularly its dependency on the dimension of vectors, leading to biased and poorly interpretable outcomes. To address this, we introduce a Dimension Insensitive Euclidean Metric (DIEM) which demonstrates superior robustness and generalizability across dimensions. DIEM maintains consistent variability and eliminates the biases observed in traditional metrics, making it a reliable tool for high-dimensional comparisons. An example of the advantages of DIEM over cosine similarity is reported for a large language model application. This novel metric has the potential to replace cosine similarity, providing a more accurate and insightful method to analyze multidimensional data in fields ranging from neuromotor control to machine learning.

Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric

TL;DR

, to detrend distance and stabilize variance across dimensions. The authors provide analytical results for Euclidean distance properties, demonstrate that

and that its variance is dimension-invariant for

, and validate DIEM on a large-language-model embedding case where it outperforms cosine similarity in high dimensions. The work offers a potentially general-purpose tool for reliable multidimensional comparisons with broad impact on NLP, ML, and computational neuroscience, while noting limitations for normalized data and encouraging further validation.

Abstract

Paper Structure (16 sections, 55 equations, 12 figures)

This paper contains 16 sections, 55 equations, 12 figures.

Introduction
Cosine Similarity and Euclidean Distance
Effect of Vector Dimensionality
Mathematical Properties of the Euclidean Distance
A Dimension-Insensitive Metric for Multidimensional Comparison
A Case Study: LLMs Text Embeddings
Discussion
Appendix
Cosine Similarity and Euclidean Distance
Signed Cosine Similarity
Effect of Vectors' Distribution
Transition from Non-normal to Normal Distribution
Effect of Dimensions on Manhattan Distance
Proof of Convergence of Cosine Similarity
All Real Case
...and 1 more sections

Figures (12)

Figure 1: Algorithm used for the sensitivity analysis of cosine similarity with respect to vector dimension and domain.
Figure 2: Panel (a): Cosine similarity boxplots for increasing dimension of the vectors $\mathbf{a}$ and $\mathbf{b}$. Panel (b): Normalized Euclidean distance boxplots for increasing dimension of the vectors $\mathbf{a}$ and $\mathbf{b}$. The three sub-panels show, respectively, the case in which vector elements were only positive (left), only negative (center) or could assume all real values within the given range (right).
Figure 3: Panel (a): Euclidean distance for increasing dimension of the vectors $\mathbf{a}$ and $\mathbf{b}$. The three sub-panels show, respectively, the case in which vectors elements were only positive (left), only negative (center) or could assume all real values within the given range (right). Panel (b): Euclidean distance for increasing dimension of the vectors $\mathbf{a}$ and $\mathbf{b}$. The blue circles show the minimum, maximum and expected analytical Euclidean distance values.
Figure 4: Panel (a): Histograms of the non-normalized Euclidean distance for growing dimensions ‘n’. Panel (b): Histograms of the detrended Euclidean distance for growing dimensions ‘n’.
Figure 5: Dimension Insensitive Euclidean Metric (DIEM) for increasing dimension of the vectors $\mathbf{a}$ and $\mathbf{b}$. The three panels show, respectively, the case in which vector elements were only positive (left), only negative (center) or could assume all real values within the given range (right).
...and 7 more figures

Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric

TL;DR

Abstract

Surpassing Cosine Similarity for Multidimensional Comparisons: Dimension Insensitive Euclidean Metric

Authors

TL;DR

Abstract

Table of Contents

Figures (12)