Efficiently Constructing Sparse Navigable Graphs
Alex Conway, Laxman Dhulipala, Martin Farach-Colton, Rob Johnson, Ben Landrum, Christopher Musco, Yarin Shechter, Torsten Suel, Richard Wen
TL;DR
This work tackles the fundamental problem of efficiently constructing sparse navigable graphs for graph-based nearest neighbor search. By recasting sparsest navigable graph construction as n correlated set-cover instances and leveraging a distance-based permutation matrix, the authors develop a randomized algorithm that achieves a near-quadratic runtime $\tilde{O}(n^2)$ while guaranteeing an $O(\log n)$-approximation of the sparsest graph, under general distance functions. The approach combines sublinear-time set cover methods, a greedy-simulation for edge selection, and preprocessing steps with random edges and cliques to bound uncovered elements; hardness results under SETH and NP-hardness of improving the approximation show these guarantees are near-optimal. Extensions to $\alpha$-shortcut reachable and $\tau$-monotonic graphs achieve similar guarantees with runtimes $\tilde{O}(n^{2.5})$ in certain settings, and concurrent work confirms near-optimal time complexity via a related set-cover perspective. The paper thus advances both the theory and practice of scalable, provably-good navigable-graph construction for high-volume nearest neighbor tasks.
Abstract
Graph-based nearest neighbor search methods have seen a surge of popularity in recent years, offering state-of-the-art performance across a wide variety of applications. Central to these methods is the task of constructing a sparse navigable search graph for a given dataset endowed with a distance function. Unfortunately, doing so is computationally expensive, so heuristics are universally used in practice. In this work, we initiate the study of fast algorithms with provable guarantees for search graph construction. For a dataset with $n$ data points, the problem of constructing an optimally sparse navigable graph can be framed as $n$ separate but highly correlated minimum set cover instances. This yields a naive $O(n^3)$ time greedy algorithm that returns a navigable graph whose sparsity is at most $O(\log n)$ higher than optimal. We improve significantly on this baseline, taking advantage of correlation between the set cover instances to leverage techniques from streaming and sublinear-time set cover algorithms. By also introducing problem-specific pre-processing techniques, we obtain an $\tilde{O}(n^2)$ time algorithm for constructing an $O(\log n)$-approximate sparsest navigable graph under any distance function. The runtime of our method is optimal up to logarithmic factors under the Strong Exponential Time Hypothesis via a reduction from Monochromatic Closest Pair. Moreover, we prove that, as with general set cover, obtaining better than an $O(\log n)$-approximation is NP-hard, despite the significant additional structure present in the navigable graph problem. Finally, we show that our approach can also beat cubic time for the closely related and practically important problems of constructing $α$-shortcut reachable and $τ$-monotonic graphs, which are also used for nearest neighbor search. For such graphs, we obtain $\tilde{O}(n^{2.5})$ time or better algorithms.
