The kernel of graph indices for vector search
Mariano Tepper, Ted Willke
TL;DR
This work tackles the limitation that traditional navigable graph indices for vector search rely on Euclidean geometry. It introduces SVG, a kernel-based graph index derived from a kernelized NNLS/SVM formulation, and proves navigability guarantees in general metric and non-metric spaces. It further shows that popular indices like HNSW and DiskANN can be viewed as SVG specializations and extends the framework with SVG-L0 to enforce a bounded out-degree via an ell_0 constraint and subspace pursuit, achieving scalable, principled sparsity. Collectively, the approach unifies existing graph indices under a kernel-theoretic lens and offers practical, scalable graph construction with strong theoretical guarantees.
Abstract
The most popular graph indices for vector search use principles from computational geometry to build the graph. Hence, their formal graph navigability guarantees are only valid in Euclidean space. In this work, we show that machine learning can be used to build graph indices for vector search in metric and non-metric vector spaces (e.g., for inner product similarity). From this novel perspective, we introduce the Support Vector Graph (SVG), a new type of graph index that leverages kernel methods to establish the graph connectivity and that comes with formal navigability guarantees valid in metric and non-metric vector spaces. In addition, we interpret the most popular graph indices, including HNSW and DiskANN, as particular specializations of SVG and show that new navigable indices can be derived from the principles behind this specialization. Finally, we propose SVG-L0 that incorporates an $\ell_0$ sparsity constraint into the SVG kernel method to build graphs with a bounded out-degree. This yields a principled way of implementing this practical requirement, in contrast to the traditional heuristic of simply truncating the out edges of each node. Additionally, we show that SVG-L0 has a self-tuning property that avoids the heuristic of using a set of candidates to find the out-edges of each node and that keeps its computational complexity in check.
