Table of Contents
Fetching ...

Efficiently Constructing Sparse Navigable Graphs

Alex Conway, Laxman Dhulipala, Martin Farach-Colton, Rob Johnson, Ben Landrum, Christopher Musco, Yarin Shechter, Torsten Suel, Richard Wen

TL;DR

This work tackles the fundamental problem of efficiently constructing sparse navigable graphs for graph-based nearest neighbor search. By recasting sparsest navigable graph construction as n correlated set-cover instances and leveraging a distance-based permutation matrix, the authors develop a randomized algorithm that achieves a near-quadratic runtime $\tilde{O}(n^2)$ while guaranteeing an $O(\log n)$-approximation of the sparsest graph, under general distance functions. The approach combines sublinear-time set cover methods, a greedy-simulation for edge selection, and preprocessing steps with random edges and cliques to bound uncovered elements; hardness results under SETH and NP-hardness of improving the approximation show these guarantees are near-optimal. Extensions to $\alpha$-shortcut reachable and $\tau$-monotonic graphs achieve similar guarantees with runtimes $\tilde{O}(n^{2.5})$ in certain settings, and concurrent work confirms near-optimal time complexity via a related set-cover perspective. The paper thus advances both the theory and practice of scalable, provably-good navigable-graph construction for high-volume nearest neighbor tasks.

Abstract

Graph-based nearest neighbor search methods have seen a surge of popularity in recent years, offering state-of-the-art performance across a wide variety of applications. Central to these methods is the task of constructing a sparse navigable search graph for a given dataset endowed with a distance function. Unfortunately, doing so is computationally expensive, so heuristics are universally used in practice. In this work, we initiate the study of fast algorithms with provable guarantees for search graph construction. For a dataset with $n$ data points, the problem of constructing an optimally sparse navigable graph can be framed as $n$ separate but highly correlated minimum set cover instances. This yields a naive $O(n^3)$ time greedy algorithm that returns a navigable graph whose sparsity is at most $O(\log n)$ higher than optimal. We improve significantly on this baseline, taking advantage of correlation between the set cover instances to leverage techniques from streaming and sublinear-time set cover algorithms. By also introducing problem-specific pre-processing techniques, we obtain an $\tilde{O}(n^2)$ time algorithm for constructing an $O(\log n)$-approximate sparsest navigable graph under any distance function. The runtime of our method is optimal up to logarithmic factors under the Strong Exponential Time Hypothesis via a reduction from Monochromatic Closest Pair. Moreover, we prove that, as with general set cover, obtaining better than an $O(\log n)$-approximation is NP-hard, despite the significant additional structure present in the navigable graph problem. Finally, we show that our approach can also beat cubic time for the closely related and practically important problems of constructing $α$-shortcut reachable and $τ$-monotonic graphs, which are also used for nearest neighbor search. For such graphs, we obtain $\tilde{O}(n^{2.5})$ time or better algorithms.

Efficiently Constructing Sparse Navigable Graphs

TL;DR

This work tackles the fundamental problem of efficiently constructing sparse navigable graphs for graph-based nearest neighbor search. By recasting sparsest navigable graph construction as n correlated set-cover instances and leveraging a distance-based permutation matrix, the authors develop a randomized algorithm that achieves a near-quadratic runtime while guaranteeing an -approximation of the sparsest graph, under general distance functions. The approach combines sublinear-time set cover methods, a greedy-simulation for edge selection, and preprocessing steps with random edges and cliques to bound uncovered elements; hardness results under SETH and NP-hardness of improving the approximation show these guarantees are near-optimal. Extensions to -shortcut reachable and -monotonic graphs achieve similar guarantees with runtimes in certain settings, and concurrent work confirms near-optimal time complexity via a related set-cover perspective. The paper thus advances both the theory and practice of scalable, provably-good navigable-graph construction for high-volume nearest neighbor tasks.

Abstract

Graph-based nearest neighbor search methods have seen a surge of popularity in recent years, offering state-of-the-art performance across a wide variety of applications. Central to these methods is the task of constructing a sparse navigable search graph for a given dataset endowed with a distance function. Unfortunately, doing so is computationally expensive, so heuristics are universally used in practice. In this work, we initiate the study of fast algorithms with provable guarantees for search graph construction. For a dataset with data points, the problem of constructing an optimally sparse navigable graph can be framed as separate but highly correlated minimum set cover instances. This yields a naive time greedy algorithm that returns a navigable graph whose sparsity is at most higher than optimal. We improve significantly on this baseline, taking advantage of correlation between the set cover instances to leverage techniques from streaming and sublinear-time set cover algorithms. By also introducing problem-specific pre-processing techniques, we obtain an time algorithm for constructing an -approximate sparsest navigable graph under any distance function. The runtime of our method is optimal up to logarithmic factors under the Strong Exponential Time Hypothesis via a reduction from Monochromatic Closest Pair. Moreover, we prove that, as with general set cover, obtaining better than an -approximation is NP-hard, despite the significant additional structure present in the navigable graph problem. Finally, we show that our approach can also beat cubic time for the closely related and practically important problems of constructing -shortcut reachable and -monotonic graphs, which are also used for nearest neighbor search. For such graphs, we obtain time or better algorithms.

Paper Structure

This paper contains 44 sections, 34 theorems, 41 equations, 6 figures, 6 algorithms.

Key Result

Theorem 1

There is a randomized algorithm running in $\tilde{O}(n^2)$ timeWe assume $d(\cdot,\cdot)$ can be evaluated in $O(1)$ time. Formally, if $d(\cdot,\cdot)$ takes $T$ time to evaluate, our method runs in $\tilde{O}(n^2) + O(n^2\cdot T)$ time. Throughout, we use $\tilde{O}(m)$ to denote $O(m\log^c m)$ f

Figures (6)

  • Figure 1: Illustration of the navigability set cover problem corresponding to node $p_1$. Each image shows the set, $S_{1\rightarrow j}$, for a different choice of out-neighbor $j$. $S_{1\rightarrow j}$ contains all $p_k$ for which $d(p_j,p_k) < d(p_i,p_k)$. Constructing a navigable graph which has the fewest out-edges from node $1$ is equivalent to solving the minimum set cover problem for this instance. Constructing a navigable graph with the fewest total number of edges is equivalent to solving the minimum set cover problem for all $n$ different problem instances.
  • Figure 2: An example distance-based permutation matrix, $\Pi$, for a data set in two-dimensional Euclidean space. The $i^\text{th}$ row contains all points in $P$ sorted in increasing order of their distance from $p_i$. $\Pi$'s first row is $[p_1, p_5, p_4, p_6, p_2 p_3]$ since $d(p_1,p_1) < d(p_1,p_5) <\ldots < d(p_1,p_3)$. $\Pi$ can be used to quickly identify what sets a node $p_k$ is contained in for a particular set cover instance $\mathcal{I}_i$: if $p_k \in S_{i\rightarrow j}$, then $p_j$ will lie to the left of $p_i$ in the $k^\text{th}$ row of $\Pi$. This property is illustrated for a particular set $S_{1\rightarrow 3} = \{p_2, p_3, p_4\}$, highlighted in red. As we can see, $p_3$ lies to the left of $p_1$ in rows $2,3,4$ in $\Pi$, and to the right in all other rows.
  • Figure 3: A partial illustration of the graph construction used to prove \ref{['thm:opt-upper-bound']}. We begin by partitioning our points into $O(\sqrt{n})$ arbitrary groups of size $O(\sqrt{n})$. An example group, $K$, is illustrated in green. We connect all points in $K$ via a clique and for every point $p_k \notin K$, we draw an edge from $p_k$'s nearest neighbor in $K$ to $p_k$. By \ref{['lem:cliques']}, this yields a navigable graph. The total number of edges is $O(n^{3/2})$.
  • Figure 4: Specification of distance metric $d$ on different types of points in $P$.
  • Figure 5: Euclidean distance between different types of points in $\varphi'(U,\mathcal{F}).$
  • ...and 1 more figures

Theorems & Definitions (98)

  • Definition 1: Navigable Graph
  • Theorem 1
  • Claim 1.1
  • Theorem 2
  • Corollary 2.1
  • Definition 2: Distance Function
  • Claim 3.1
  • proof
  • Claim 3.2
  • proof
  • ...and 88 more