Table of Contents
Fetching ...

GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search

Jifan Shi, Jianyang Gao, James Xia, Tamás Béla Fehér, Cheng Long

TL;DR

IVF-RaBitQ (GPU) is presented, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline and develops a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale.

Abstract

Approximate nearest neighbor search (ANNS) on GPUs is gaining increasing popularity for modern retrieval and recommendation workloads that operate over massive high-dimensional vectors. Graph-based indexes deliver high recall and throughput but incur heavy build-time and storage costs. In contrast, cluster-based methods build and scale efficiently yet often need many probes for high recall, straining memory bandwidth and compute. Aiming to simultaneously achieve fast index build, high-throughput search, high recall, and low storage requirement for GPUs, we present IVF-RaBitQ (GPU), a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. Specifically, for index build, we develop a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale. For search, we develop GPU-native distance computation schemes for RaBitQ codes and a fused search kernel to achieve high throughput with high recall. With IVF-RaBitQ implemented and integrated into the NVIDIA cuVS Library, experiments on cuVS Bench across multiple datasets show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint. For Recall approximately equal to 0.95, IVF-RaBitQ achieves 2.2x higher QPS than the state-of-the-art graph-based method CAGRA, while also constructing indices 7.7x faster on average. Compared to the cluster-based method IVF-PQ, IVF-RaBitQ delivers on average over 2.7x higher throughput while avoiding accessing the raw vectors for reranking.

GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search

TL;DR

IVF-RaBitQ (GPU) is presented, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline and develops a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale.

Abstract

Approximate nearest neighbor search (ANNS) on GPUs is gaining increasing popularity for modern retrieval and recommendation workloads that operate over massive high-dimensional vectors. Graph-based indexes deliver high recall and throughput but incur heavy build-time and storage costs. In contrast, cluster-based methods build and scale efficiently yet often need many probes for high recall, straining memory bandwidth and compute. Aiming to simultaneously achieve fast index build, high-throughput search, high recall, and low storage requirement for GPUs, we present IVF-RaBitQ (GPU), a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. Specifically, for index build, we develop a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale. For search, we develop GPU-native distance computation schemes for RaBitQ codes and a fused search kernel to achieve high throughput with high recall. With IVF-RaBitQ implemented and integrated into the NVIDIA cuVS Library, experiments on cuVS Bench across multiple datasets show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint. For Recall approximately equal to 0.95, IVF-RaBitQ achieves 2.2x higher QPS than the state-of-the-art graph-based method CAGRA, while also constructing indices 7.7x faster on average. Compared to the cluster-based method IVF-PQ, IVF-RaBitQ delivers on average over 2.7x higher throughput while avoiding accessing the raw vectors for reranking.
Paper Structure (32 sections, 10 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 32 sections, 10 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Bitwise Inner Product Computation. $\hat{\mathbf{x}}_\mathrm{b}[i]$ denotes the $i$th dimension (bit) of the data vector $\hat{\mathbf{x}}_\mathrm{b}$, and $\hat{\mathbf{q}}^{(j)}[i]$ denotes the $j$th bit of the query vector's $i$th dimension $\hat{\mathbf{q}}[i]$.
  • Figure 2: Data layout for Inverted Lists in IVF-RaBitQ. nk is the number of clusters.
  • Figure 3: Time--accuracy trade-off of ANN search on representative datasets (log-scale). Bitwise and LUT denote the two GPU inner-product methods used by IVF-RaBitQ, while w/ refine and w/o refine indicate whether distance refinement is enabled for IVF-PQ.
  • Figure 4: Index build time across different datasets. Each subplot compares build time across methods.
  • Figure 5: Storage requirement of different indexing methods. CAGRA and IVF-PQ (w/ref) require RAW vectors for refinement, thus RAW storage is included.
  • ...and 2 more figures