Table of Contents
Fetching ...

Exploring the Meaningfulness of Nearest Neighbor Search in High-Dimensional Space

Zhonghan Chen, Ruiyuan Zhang, Xi Zhao, Xiaojun Cheng, Xiaofang Zhou

TL;DR

Addressing whether nearest neighbor search (NNS) remains meaningful in high‑dimensional embeddings, the paper uses relative contrast $C_r = \frac{D_{mean}}{D_{min}}$ and local intrinsic dimensionality $LID_X(x) = \frac{x f_X(x)}{F_X(x)}$ to quantify intrinsic meaningfulness and compares across data modalities. It experiments with multiple distance metrics ($L_1$, $L_2$, angular) and dimensionalities up to thousands of dimensions on six real‑world text datasets, two image datasets, and synthesized random vectors, using text embeddings from all‑MiniLM‑L6‑V2 and bert-base-nli-mean-tokens and CLIP image embeddings. The key finding is that random vectors rapidly lose meaningful NNS as dimensionality grows (RC approaches $1$), while real‑world text embeddings maintain meaningful NNS even at very high dimensionality (RC staying well above 1, e.g., $1.75$–$2.05$). Moreover, the choice of distance function has only a marginal effect on NNS meaningfulness, highlighting the robustness of embedding‑based representations for retrieval tasks such as RAG.

Abstract

Dense high dimensional vectors are becoming increasingly vital in fields such as computer vision, machine learning, and large language models (LLMs), serving as standard representations for multimodal data. Now the dimensionality of these vector can exceed several thousands easily. Despite the nearest neighbor search (NNS) over these dense high dimensional vectors have been widely used for retrieval augmented generation (RAG) and many other applications, the effectiveness of NNS in such a high-dimensional space remains uncertain, given the possible challenge caused by the "curse of dimensionality." To address above question, in this paper, we conduct extensive NNS studies with different distance functions, such as $L_1$ distance, $L_2$ distance and angular-distance, across diverse embedding datasets, of varied types, dimensionality and modality. Our aim is to investigate factors influencing the meaningfulness of NNS. Our experiments reveal that high-dimensional text embeddings exhibit increased resilience as dimensionality rises to higher levels when compared to random vectors. This resilience suggests that text embeddings are less affected to the "curse of dimensionality," resulting in more meaningful NNS outcomes for practical use. Additionally, the choice of distance function has minimal impact on the relevance of NNS. Our study shows the effectiveness of the embedding-based data representation method and can offer opportunity for further optimization of dense vector-related applications.

Exploring the Meaningfulness of Nearest Neighbor Search in High-Dimensional Space

TL;DR

Addressing whether nearest neighbor search (NNS) remains meaningful in high‑dimensional embeddings, the paper uses relative contrast and local intrinsic dimensionality to quantify intrinsic meaningfulness and compares across data modalities. It experiments with multiple distance metrics (, , angular) and dimensionalities up to thousands of dimensions on six real‑world text datasets, two image datasets, and synthesized random vectors, using text embeddings from all‑MiniLM‑L6‑V2 and bert-base-nli-mean-tokens and CLIP image embeddings. The key finding is that random vectors rapidly lose meaningful NNS as dimensionality grows (RC approaches ), while real‑world text embeddings maintain meaningful NNS even at very high dimensionality (RC staying well above 1, e.g., ). Moreover, the choice of distance function has only a marginal effect on NNS meaningfulness, highlighting the robustness of embedding‑based representations for retrieval tasks such as RAG.

Abstract

Dense high dimensional vectors are becoming increasingly vital in fields such as computer vision, machine learning, and large language models (LLMs), serving as standard representations for multimodal data. Now the dimensionality of these vector can exceed several thousands easily. Despite the nearest neighbor search (NNS) over these dense high dimensional vectors have been widely used for retrieval augmented generation (RAG) and many other applications, the effectiveness of NNS in such a high-dimensional space remains uncertain, given the possible challenge caused by the "curse of dimensionality." To address above question, in this paper, we conduct extensive NNS studies with different distance functions, such as distance, distance and angular-distance, across diverse embedding datasets, of varied types, dimensionality and modality. Our aim is to investigate factors influencing the meaningfulness of NNS. Our experiments reveal that high-dimensional text embeddings exhibit increased resilience as dimensionality rises to higher levels when compared to random vectors. This resilience suggests that text embeddings are less affected to the "curse of dimensionality," resulting in more meaningful NNS outcomes for practical use. Additionally, the choice of distance function has minimal impact on the relevance of NNS. Our study shows the effectiveness of the embedding-based data representation method and can offer opportunity for further optimization of dense vector-related applications.
Paper Structure (15 sections, 3 equations, 8 figures, 2 tables)

This paper contains 15 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Compare the homogeneity of RC and LID
  • Figure 2: Example: Top-$5$ similar texts of query text retrieved by the NNS
  • Figure 3: Example: Similar images of query image retrieved by the NNS
  • Figure 4: Explore the impact of distance function on the Relative Contrast (Upper: sort by dataset || Lower: sort by function)
  • Figure 5: Explore the impact of dimensionality on high-dimensional random vector
  • ...and 3 more figures

Theorems & Definitions (3)

  • definition thmcounterdefinition: Relative Contrastrc-paper
  • definition thmcounterdefinition: Local Intrinsic Dimensionalitylid-paper
  • definition thmcounterdefinition: $k$ Nearest Neighbor Search