Table of Contents
Fetching ...

The kernel of graph indices for vector search

Mariano Tepper, Ted Willke

TL;DR

This work tackles the limitation that traditional navigable graph indices for vector search rely on Euclidean geometry. It introduces SVG, a kernel-based graph index derived from a kernelized NNLS/SVM formulation, and proves navigability guarantees in general metric and non-metric spaces. It further shows that popular indices like HNSW and DiskANN can be viewed as SVG specializations and extends the framework with SVG-L0 to enforce a bounded out-degree via an ell_0 constraint and subspace pursuit, achieving scalable, principled sparsity. Collectively, the approach unifies existing graph indices under a kernel-theoretic lens and offers practical, scalable graph construction with strong theoretical guarantees.

Abstract

The most popular graph indices for vector search use principles from computational geometry to build the graph. Hence, their formal graph navigability guarantees are only valid in Euclidean space. In this work, we show that machine learning can be used to build graph indices for vector search in metric and non-metric vector spaces (e.g., for inner product similarity). From this novel perspective, we introduce the Support Vector Graph (SVG), a new type of graph index that leverages kernel methods to establish the graph connectivity and that comes with formal navigability guarantees valid in metric and non-metric vector spaces. In addition, we interpret the most popular graph indices, including HNSW and DiskANN, as particular specializations of SVG and show that new navigable indices can be derived from the principles behind this specialization. Finally, we propose SVG-L0 that incorporates an $\ell_0$ sparsity constraint into the SVG kernel method to build graphs with a bounded out-degree. This yields a principled way of implementing this practical requirement, in contrast to the traditional heuristic of simply truncating the out edges of each node. Additionally, we show that SVG-L0 has a self-tuning property that avoids the heuristic of using a set of candidates to find the out-edges of each node and that keeps its computational complexity in check.

The kernel of graph indices for vector search

TL;DR

This work tackles the limitation that traditional navigable graph indices for vector search rely on Euclidean geometry. It introduces SVG, a kernel-based graph index derived from a kernelized NNLS/SVM formulation, and proves navigability guarantees in general metric and non-metric spaces. It further shows that popular indices like HNSW and DiskANN can be viewed as SVG specializations and extends the framework with SVG-L0 to enforce a bounded out-degree via an ell_0 constraint and subspace pursuit, achieving scalable, principled sparsity. Collectively, the approach unifies existing graph indices under a kernel-theoretic lens and offers practical, scalable graph construction with strong theoretical guarantees.

Abstract

The most popular graph indices for vector search use principles from computational geometry to build the graph. Hence, their formal graph navigability guarantees are only valid in Euclidean space. In this work, we show that machine learning can be used to build graph indices for vector search in metric and non-metric vector spaces (e.g., for inner product similarity). From this novel perspective, we introduce the Support Vector Graph (SVG), a new type of graph index that leverages kernel methods to establish the graph connectivity and that comes with formal navigability guarantees valid in metric and non-metric vector spaces. In addition, we interpret the most popular graph indices, including HNSW and DiskANN, as particular specializations of SVG and show that new navigable indices can be derived from the principles behind this specialization. Finally, we propose SVG-L0 that incorporates an sparsity constraint into the SVG kernel method to build graphs with a bounded out-degree. This yields a principled way of implementing this practical requirement, in contrast to the traditional heuristic of simply truncating the out edges of each node. Additionally, we show that SVG-L0 has a self-tuning property that avoids the heuristic of using a set of candidates to find the out-edges of each node and that keeps its computational complexity in check.

Paper Structure

This paper contains 14 sections, 17 theorems, 55 equations, 21 figures, 1 table.

Key Result

Lemma 1

Let $G = ([1 \dots n], {\mathcal{E}})$ be a monotonic search network. Let $s, t \in [1 \dots n]$, then algo:greedy_search with ${\bm{\mathbf{x}}}_t$ as the query and $s$ as the entry point finds a monotonic path from $s$ to $t$ in $G$.

Figures (21)

  • Figure 1: Graph index construction
  • Figure 2: Conceptual depiction of the pruning strategy in Euclidean space to find the out-edges of node $i$ in the graph index $G = ([1 \dots n], {\mathcal{E}})$. Attractive (inward arrowheads) and repulsive (outward arrowheads) forces promote similarity with $i$ or diversity between candidates, respectively. Blue and red arrows depict favorable and less favorable forces, respectively. Here, $\{ \vv{ij}, \vv{ik} \} \subset {\mathcal{E}}$ but $\vv{il} \not\in {\mathcal{E}}$ as one can move from $i$ to $k$ and then from $k$ to $l$ using the greedy search in algo:greedy_search.
  • Figure 3: (Left) Conceptual representation of the SVM hyperplane and margins involved in SVG. Here, the vector ${\bm{\mathbf{x}}}_i$ is connected to its support vectors ${\bm{\mathbf{x}}}_1$ and ${\bm{\mathbf{x}}}_2$ for which $f_i ({\bm{\mathbf{x}}}_1) = f_i ({\bm{\mathbf{x}}}_2) = -1$, see eq:decision_function. (Right) Example of the SVM decision function values (the level sets $f_i ({\bm{\mathbf{x}}}) = 1, 0, -1$ are marked in dotted red, black and blue lines, respectively). We observe that the function $f_i$ adjusts its shape to the topology of its surrounding points (i.e., the area where $f_i ({\bm{\mathbf{x}}}) > 0$ adapts to its surroundings).
  • Figure 4: The SVM decision boundaries (left), i.e., $f_i ({\bm{\mathbf{x}}}) = 0$, for each point in a regular 2D grid; see eq:decision_function. The function $f({\bm{\mathbf{x}}}) = \max_i f_i ({\bm{\mathbf{x}}})$ (center) induces a tessellation. As expected, the tessellation, found by running a watershed algorithm on $f({\bm{\mathbf{x}}})$, forms a regular grid (right).
  • Figure 5: With the RBF kernel, Problem [noname]prob:kernel_svm_separation connects the central node $i$ to a subset of its Delaunay neighbors ${\mathcal{D}}_i$. For most configurations, the SVG neighbors ${\mathcal{N}}_i \subset {\mathcal{D}}_i$ (left/right plots). However, ${\mathcal{N}}_i = {\mathcal{D}}_i$ in odd configurations, e.g., when the vectors in ${\mathcal{D}}_i$ are equi-spread and equidistant to $i$ (center).
  • ...and 16 more figures

Theorems & Definitions (22)

  • Definition 1: Monotonic Path fu_fast_2019
  • Definition 2: Monotonic Search Network fu_fast_2019
  • Lemma 1: fu_fast_2019
  • Definition 3
  • Lemma 2
  • theorem 1
  • theorem 2
  • Definition 4: Generalized Monotonic Path
  • Definition 5: Generalized Monotonic Search Network
  • Lemma 3
  • ...and 12 more