Table of Contents
Fetching ...

Approximate Diverse $k$-nearest Neighbor Search in Vector Database

Jiachen Zhao, Xiao Yan, Eric Lo

TL;DR

This work proposes a novel approach that seamlessly integrates result diversification into state-of-the-art (SOTA) A$k$-NNS methods, and introduces a progressive search framework, consisting of iterative searching, diversification, and verification phases.

Abstract

Approximate $k$-nearest neighbor search (A$k$-NNS) is a core operation in vector databases, underpinning applications such as retrieval-augmented generation (RAG) and image retrieval. In these scenarios, users often prefer diverse result sets to minimize redundancy and enhance information value. However, existing greedy-based diverse methods frequently yield sub-optimal results, failing to adequately approximate the optimal similarity score under certain diversification level. Furthermore, there is a need for flexible algorithms that can adapt to varying user-defined result sizes and diversity requirements. To address these challenges, we propose a novel approach that seamlessly integrates result diversification into state-of-the-art (SOTA) A$k$-NNS methods. Our approach introduces a progressive search framework, consisting of iterative searching, diversification, and verification phases. Carefully designed diversification and verification steps enable our approach to efficiently approximate the optimal diverse result set according to user-specified diversification levels without additional indexing overhead. We evaluate our method on three million-scale benchmark datasets, LAION-art, Deep1M, and Txt2img, using latency, similarity, and recall as performance metrics across a range of $k$ values and diversification thresholds. Experimental results demonstrate that our approach consistently retrieves near-optimal diverse results with minimal latency overhead, particularly under medium and high diversity settings.

Approximate Diverse $k$-nearest Neighbor Search in Vector Database

TL;DR

This work proposes a novel approach that seamlessly integrates result diversification into state-of-the-art (SOTA) A-NNS methods, and introduces a progressive search framework, consisting of iterative searching, diversification, and verification phases.

Abstract

Approximate -nearest neighbor search (A-NNS) is a core operation in vector databases, underpinning applications such as retrieval-augmented generation (RAG) and image retrieval. In these scenarios, users often prefer diverse result sets to minimize redundancy and enhance information value. However, existing greedy-based diverse methods frequently yield sub-optimal results, failing to adequately approximate the optimal similarity score under certain diversification level. Furthermore, there is a need for flexible algorithms that can adapt to varying user-defined result sizes and diversity requirements. To address these challenges, we propose a novel approach that seamlessly integrates result diversification into state-of-the-art (SOTA) A-NNS methods. Our approach introduces a progressive search framework, consisting of iterative searching, diversification, and verification phases. Carefully designed diversification and verification steps enable our approach to efficiently approximate the optimal diverse result set according to user-specified diversification levels without additional indexing overhead. We evaluate our method on three million-scale benchmark datasets, LAION-art, Deep1M, and Txt2img, using latency, similarity, and recall as performance metrics across a range of values and diversification thresholds. Experimental results demonstrate that our approach consistently retrieves near-optimal diverse results with minimal latency overhead, particularly under medium and high diversity settings.

Paper Structure

This paper contains 25 sections, 3 theorems, 9 equations, 16 figures, 4 tables, 4 algorithms.

Key Result

Theorem 1

If $K\geq\sum_{v\in\Phi}(\phi_v+1)+1$, then $V_K$ is sufficient to find the optimal diverse set of $V$.

Figures (16)

  • Figure 1: Query results from LAION-art schuhmann2022laion5bopenlargescaledataset dataset. $k=5$ and $q_{text}=$"a photo of a red dress" for all 4 cases. (a) The results without diversification. (b) The optimal results when $\epsilon=0.96$. (c) The optimal results when $\epsilon=0.6$. (d) The results of greedy algorithm when $\epsilon=0.6$.
  • Figure 2:
  • Figure 3:
  • Figure 5: A demonstration of the div-A* algorithm workflow on the diversity graph in Figure \ref{['fig:ge']}. The gray nodes indicate the search path taken by the algorithm. The white nodes are pruned from further exploration because their maximum possible scores are lower than the best score observed so far.
  • Figure 6: An example illustrating the difference between greedy algorithm and div-A* algorithm.
  • ...and 11 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof