Table of Contents
Fetching ...

CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs

Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, Yong Wang

TL;DR

CAGRA addresses the need for fast, scalable ANNS on GPUs by designing a fixed-outdegree proximity graph and a GPU-oriented construction and search pipeline. It combines rank-based graph optimization with reverse-edge addition to boost two-hop reachability while avoiding expensive distance calculations, and it implements a GPU-tailored search using warp-split teams, memory-efficient hash tables, and a flexible single- or multi-CTA strategy. Empirical results show substantial speedups over CPU and GPU state-of-the-art methods in both graph construction and large-batch and single-query search, with competitive recall across multiple datasets and scales. The approach enables high-throughput, low-latency nearest neighbor search for modern data-intensive applications and is available in NVIDIA RAPIDS RAFT for broader use.

Abstract

Approximate Nearest Neighbor Search (ANNS) plays a critical role in various disciplines spanning data mining and artificial intelligence, from information retrieval and computer vision to natural language processing and recommender systems. Data volumes have soared in recent years and the computational cost of an exhaustive exact nearest neighbor search is often prohibitive, necessitating the adoption of approximate techniques. The balanced performance and recall of graph-based approaches have more recently garnered significant attention in ANNS algorithms, however, only a few studies have explored harnessing the power of GPUs and multi-core processors despite the widespread use of massively parallel and general-purpose computing. To bridge this gap, we introduce a novel parallel computing hardware-based proximity graph and search algorithm. By leveraging the high-performance capabilities of modern hardware, our approach achieves remarkable efficiency gains. In particular, our method surpasses existing CPU and GPU-based methods in constructing the proximity graph, demonstrating higher throughput in both large- and small-batch searches while maintaining compatible accuracy. In graph construction time, our method, CAGRA, is 2.2~27x faster than HNSW, which is one of the CPU SOTA implementations. In large-batch query throughput in the 90% to 95% recall range, our method is 33~77x faster than HNSW, and is 3.8~8.8x faster than the SOTA implementations for GPU. For a single query, our method is 3.4~53x faster than HNSW at 95% recall.

CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs

TL;DR

CAGRA addresses the need for fast, scalable ANNS on GPUs by designing a fixed-outdegree proximity graph and a GPU-oriented construction and search pipeline. It combines rank-based graph optimization with reverse-edge addition to boost two-hop reachability while avoiding expensive distance calculations, and it implements a GPU-tailored search using warp-split teams, memory-efficient hash tables, and a flexible single- or multi-CTA strategy. Empirical results show substantial speedups over CPU and GPU state-of-the-art methods in both graph construction and large-batch and single-query search, with competitive recall across multiple datasets and scales. The approach enables high-throughput, low-latency nearest neighbor search for modern data-intensive applications and is available in NVIDIA RAPIDS RAFT for broader use.

Abstract

Approximate Nearest Neighbor Search (ANNS) plays a critical role in various disciplines spanning data mining and artificial intelligence, from information retrieval and computer vision to natural language processing and recommender systems. Data volumes have soared in recent years and the computational cost of an exhaustive exact nearest neighbor search is often prohibitive, necessitating the adoption of approximate techniques. The balanced performance and recall of graph-based approaches have more recently garnered significant attention in ANNS algorithms, however, only a few studies have explored harnessing the power of GPUs and multi-core processors despite the widespread use of massively parallel and general-purpose computing. To bridge this gap, we introduce a novel parallel computing hardware-based proximity graph and search algorithm. By leveraging the high-performance capabilities of modern hardware, our approach achieves remarkable efficiency gains. In particular, our method surpasses existing CPU and GPU-based methods in constructing the proximity graph, demonstrating higher throughput in both large- and small-batch searches while maintaining compatible accuracy. In graph construction time, our method, CAGRA, is 2.2~27x faster than HNSW, which is one of the CPU SOTA implementations. In large-batch query throughput in the 90% to 95% recall range, our method is 33~77x faster than HNSW, and is 3.8~8.8x faster than the SOTA implementations for GPU. For a single query, our method is 3.4~53x faster than HNSW at 95% recall.
Paper Structure (41 sections, 3 equations, 16 figures, 2 tables)

This paper contains 41 sections, 3 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Construction flow of the CAGRA graph.
  • Figure 2: CAGRA edge reordering and pruning flow. We assume pruning edges from the node $X$. Left: The initial rank of the edges from $X$ and other related edges. Middle: Possible two-hop routes, classified as detourable and not detourable by Eq. \ref{['eq:detourable-condition']}. We use the rank instead of the distance. Right: The number of detourable routes of each node connected to $X$. The edges are discarded from the end of the list ordered by the number of detourable routes. In this example, the nodes $A$, $B$, and $E$ are preserved as the neighbors of node $X$, although the node $E$ is the farthest in the initial neighbors of node $X$ in the distance.
  • Figure 3: The 2-hop node counts and strong CC comparison among a $k$-NN graph, partially and fully optimized graphs by CAGRA from an initial $k$-NN graph. The number in each bracket in the label is the degree of the graph ($d$), and we set the degree of the initial graph as $d_\text{init}=3d$.
  • Figure 4: CAGRA graph optimization time comparison with rank- and distance-based reordering.
  • Figure 5: CAGRA search performance comparison between the graphs optimized by rank- and distance-based reordering. CAGRA performs rank-based reordering while CAGRA (distance-based) performs distance-based.
  • ...and 11 more figures