Table of Contents
Fetching ...

BBC: Improving Large-k Approximate Nearest Neighbor Search with a Bucket-based Result Collector

Ziqi Yin, Gao Cong, Kai Zeng, Jinwei Zhu, Bin Cui

Abstract

Although Approximate Nearest Neighbor (ANN) search has been extensively studied, large-k ANN queries that aim to retrieve a large number of nearest neighbors remain underexplored, despite their numerous real-world applications. Existing ANN methods face significant performance degradation for such queries. In this work, we first investigate the reasons for the performance degradation of quantization-based ANN indexes: (1) the inefficiency of existing top-k collectors, which incurs significant overhead in candidate maintenance, and (2) the reduced pruning effectiveness of quantization methods, which leads to a costly re-ranking process. To address this, we propose a novel bucket-based result collector (BBC) to enhance the efficiency of existing quantization-based ANN indexes for large-k ANN queries. BBC introduces two key components: (1) a bucket-based result buffer that organizes candidates into buckets by their distances to the query. This design reduces ranking costs and improves cache efficiency, enabling high performance maintenance of a candidate superset and a lightweight final selection of top-k results. (2) two re-ranking algorithms tailored for different types of quantization methods, which accelerate their re-ranking process by reducing either the number of candidate objects to be re-ranked or cache misses. Extensive experiments on real-world datasets demonstrate that BBC accelerates existing quantization-based ANN methods by up to 3.8x at recall@k = 0.95 for large-k ANN queries.

BBC: Improving Large-k Approximate Nearest Neighbor Search with a Bucket-based Result Collector

Abstract

Although Approximate Nearest Neighbor (ANN) search has been extensively studied, large-k ANN queries that aim to retrieve a large number of nearest neighbors remain underexplored, despite their numerous real-world applications. Existing ANN methods face significant performance degradation for such queries. In this work, we first investigate the reasons for the performance degradation of quantization-based ANN indexes: (1) the inefficiency of existing top-k collectors, which incurs significant overhead in candidate maintenance, and (2) the reduced pruning effectiveness of quantization methods, which leads to a costly re-ranking process. To address this, we propose a novel bucket-based result collector (BBC) to enhance the efficiency of existing quantization-based ANN indexes for large-k ANN queries. BBC introduces two key components: (1) a bucket-based result buffer that organizes candidates into buckets by their distances to the query. This design reduces ranking costs and improves cache efficiency, enabling high performance maintenance of a candidate superset and a lightweight final selection of top-k results. (2) two re-ranking algorithms tailored for different types of quantization methods, which accelerate their re-ranking process by reducing either the number of candidate objects to be re-ranked or cache misses. Extensive experiments on real-world datasets demonstrate that BBC accelerates existing quantization-based ANN methods by up to 3.8x at recall@k = 0.95 for large-k ANN queries.

Paper Structure

This paper contains 16 sections, 3 theorems, 13 equations, 11 figures, 6 tables, 4 algorithms.

Key Result

theorem 1

Let $q,o \in \mathbb{R}^d$ be independently and uniformly sampled from the unit sphere. Let $R=\|q-o\|_2\in[0,2]$ denote the Euclidean distance between them, with $F(r)=\mathbb P(R\le r)$ being its cumulative distribution function (CDF). For an integer $m\ge2$, consider the equal-depth partition det Let $\widehat{R}$ be the quantized distance obtained by mapping $R \in [b_{i},b_{i+1}]$ to its uppe

Figures (11)

  • Figure 1: Querying Performance of IVF, HNSW, IVF+PQ, and IVF+RaBitQ on the C4 dataset at $k$ = 100 and $k$ = 5000.
  • Figure 2: Time Overhead Breakdown of four methods at different $k$, where "Distance computation" denotes exact distance computation, "FastScan" denotes estimated distance computation, "Heap" denotes heap operations, and "Other" covers the remaining costs.
  • Figure 3: Illustration of the Proposed Bucket-based Result Collector.
  • Figure 4: The probability density function (PDF) of distances between the query and data vectors on the C4 dataset.
  • Figure 5: The accuracy-efficiency trade-off results under different $k$ (upper and right is better).
  • ...and 6 more figures

Theorems & Definitions (3)

  • theorem 1: Expected Mean Absolute Error
  • lemma 1
  • theorem 2: Expected Mean Absolute Error