Exploring the Meaningfulness of Nearest Neighbor Search in High-Dimensional Space
Zhonghan Chen, Ruiyuan Zhang, Xi Zhao, Xiaojun Cheng, Xiaofang Zhou
TL;DR
Addressing whether nearest neighbor search (NNS) remains meaningful in high‑dimensional embeddings, the paper uses relative contrast $C_r = \frac{D_{mean}}{D_{min}}$ and local intrinsic dimensionality $LID_X(x) = \frac{x f_X(x)}{F_X(x)}$ to quantify intrinsic meaningfulness and compares across data modalities. It experiments with multiple distance metrics ($L_1$, $L_2$, angular) and dimensionalities up to thousands of dimensions on six real‑world text datasets, two image datasets, and synthesized random vectors, using text embeddings from all‑MiniLM‑L6‑V2 and bert-base-nli-mean-tokens and CLIP image embeddings. The key finding is that random vectors rapidly lose meaningful NNS as dimensionality grows (RC approaches $1$), while real‑world text embeddings maintain meaningful NNS even at very high dimensionality (RC staying well above 1, e.g., $1.75$–$2.05$). Moreover, the choice of distance function has only a marginal effect on NNS meaningfulness, highlighting the robustness of embedding‑based representations for retrieval tasks such as RAG.
Abstract
Dense high dimensional vectors are becoming increasingly vital in fields such as computer vision, machine learning, and large language models (LLMs), serving as standard representations for multimodal data. Now the dimensionality of these vector can exceed several thousands easily. Despite the nearest neighbor search (NNS) over these dense high dimensional vectors have been widely used for retrieval augmented generation (RAG) and many other applications, the effectiveness of NNS in such a high-dimensional space remains uncertain, given the possible challenge caused by the "curse of dimensionality." To address above question, in this paper, we conduct extensive NNS studies with different distance functions, such as $L_1$ distance, $L_2$ distance and angular-distance, across diverse embedding datasets, of varied types, dimensionality and modality. Our aim is to investigate factors influencing the meaningfulness of NNS. Our experiments reveal that high-dimensional text embeddings exhibit increased resilience as dimensionality rises to higher levels when compared to random vectors. This resilience suggests that text embeddings are less affected to the "curse of dimensionality," resulting in more meaningful NNS outcomes for practical use. Additionally, the choice of distance function has minimal impact on the relevance of NNS. Our study shows the effectiveness of the embedding-based data representation method and can offer opportunity for further optimization of dense vector-related applications.
