Table of Contents
Fetching ...

PiPNN: Ultra-Scalable Graph-Based Nearest Neighbor Indexing

Tobias Rubel, Richard Wen, Laxman Dhulipala, Lars Gottesbüren, Rajesh Jayaram, Jakub Łącki

TL;DR

PiPNN (Pick-in-Partitions Nearest Neighbors) is introduced, an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from and enables the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.

Abstract

The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on random-access-heavy beam searches. We introduce PiPNN (Pick-in-Partitions Nearest Neighbors), an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from. PiPNN's core innovation is HashPrune, a novel online pruning algorithm which dynamically maintains sparse collections of edges. HashPrune enables PiPNN to partition the dataset into overlapping sub-problems, efficiently perform bulk distance comparisons via dense matrix multiplication kernels, and stream a subset of the edges into HashPrune. HashPrune guarantees bounded memory during index construction which permits PiPNN to build higher quality indices without the use of extra intermediate memory. PiPNN builds state-of-the-art indexes up to 11.6x faster than Vamana (DiskANN) and up to 12.9x faster than HNSW. PiPNN is significantly more scalable than recent algorithms for fast graph construction. PiPNN builds indexes at least 19.1x faster than MIRAGE and 17.3x than FastKCNA while producing indexes that achieve higher query throughput. PiPNN enables us to build, for the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.

PiPNN: Ultra-Scalable Graph-Based Nearest Neighbor Indexing

TL;DR

PiPNN (Pick-in-Partitions Nearest Neighbors) is introduced, an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from and enables the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.

Abstract

The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on random-access-heavy beam searches. We introduce PiPNN (Pick-in-Partitions Nearest Neighbors), an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from. PiPNN's core innovation is HashPrune, a novel online pruning algorithm which dynamically maintains sparse collections of edges. HashPrune enables PiPNN to partition the dataset into overlapping sub-problems, efficiently perform bulk distance comparisons via dense matrix multiplication kernels, and stream a subset of the edges into HashPrune. HashPrune guarantees bounded memory during index construction which permits PiPNN to build higher quality indices without the use of extra intermediate memory. PiPNN builds state-of-the-art indexes up to 11.6x faster than Vamana (DiskANN) and up to 12.9x faster than HNSW. PiPNN is significantly more scalable than recent algorithms for fast graph construction. PiPNN builds indexes at least 19.1x faster than MIRAGE and 17.3x than FastKCNA while producing indexes that achieve higher query throughput. PiPNN enables us to build, for the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.
Paper Structure (30 sections, 3 theorems, 1 equation, 15 figures, 5 tables, 5 algorithms)

This paper contains 30 sections, 3 theorems, 1 equation, 15 figures, 5 tables, 5 algorithms.

Key Result

theorem 1

HashPrune is history-independent: Given a hash function $h_p$ for some point $p$, a collection of candidates $C$, and a reservoir size $\ell$, the final adjacency list produced by HashPrune is unique and independent of the insertion order of candidates in $C$.

Figures (15)

  • Figure 1: Build time speedup compared to HNSW on six benchmarks, including billion-scale inputs from big-ann-benchmarks.
  • Figure 2: Impact of resolution on neighbor retention. Shading represents the probability of collision based on number of bits used in the HashPrune hash ($m$). (a) At coarse resolution, $c'$ collides with and evicts the farther neighbor $c_2$. (b) At finer resolution, both edges are retained.
  • Figure 3: Multi-level fanout significantly reduces build times, with the effect becoming more pronounced as fanout is increased.
  • Figure 4: Portion of time spent in Partitioning, Leaf Building, and Final Prune for the four ablation datasets.
  • Figure 5: QPS vs recall for all algorithms on four billion-size datasets (BigANN-1B, MS-SPACEV-1B, MS-TURING-1B, and DEEP-1B) and two high-dimensional datasets (OpenAIArXiv-2M and WikipediaCohere-35M). Build times are listed in the legends in seconds.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • theorem 1
  • lemma 1
  • lemma 2