Table of Contents
Fetching ...

Graph-based Nearest Neighbors with Dynamic Updates via Random Walks

Nina Mishra, Yonatan Naamad, Tal Wagner, Lichen Zhang

TL;DR

The paper addresses the challenge of deleting points from graph-based ANN indexes like HNSW without sacrificing performance. It introduces SPatch, a deletion procedure grounded in a random-walk framework that preserves hitting-time statistics via a star-mesh transform and a sparsified clique over the deleted point’s neighborhood, rendered deterministically by top-edge selection. Through extensive mass-deletion experiments, SPatch demonstrates strong recall, fast deletions, low query latency, and reduced memory usage compared to existing approaches. The work also shows that a softmax-based random walk closely mirrors greedy search, validating the theoretical model and offering a new lens for analyzing and improving dynamic graph-based ANN systems.

Abstract

Approximate nearest neighbor search (ANN) is a common way to retrieve relevant search results, especially now in the context of large language models and retrieval augmented generation. One of the most widely used algorithms for ANN is based on constructing a multi-layer graph over the dataset, called the Hierarchical Navigable Small World (HNSW). While this algorithm supports insertion of new data, it does not support deletion of existing data. Moreover, deletion algorithms described by prior work come at the cost of increased query latency, decreased recall, or prolonged deletion time. In this paper, we propose a new theoretical framework for graph-based ANN based on random walks. We then utilize this framework to analyze a randomized deletion approach that preserves hitting time statistics compared to the graph before deleting the point. We then turn this theoretical framework into a deterministic deletion algorithm, and show that it provides better tradeoff between query latency, recall, deletion time, and memory usage through an extensive collection of experiments.

Graph-based Nearest Neighbors with Dynamic Updates via Random Walks

TL;DR

The paper addresses the challenge of deleting points from graph-based ANN indexes like HNSW without sacrificing performance. It introduces SPatch, a deletion procedure grounded in a random-walk framework that preserves hitting-time statistics via a star-mesh transform and a sparsified clique over the deleted point’s neighborhood, rendered deterministically by top-edge selection. Through extensive mass-deletion experiments, SPatch demonstrates strong recall, fast deletions, low query latency, and reduced memory usage compared to existing approaches. The work also shows that a softmax-based random walk closely mirrors greedy search, validating the theoretical model and offering a new lens for analyzing and improving dynamic graph-based ANN systems.

Abstract

Approximate nearest neighbor search (ANN) is a common way to retrieve relevant search results, especially now in the context of large language models and retrieval augmented generation. One of the most widely used algorithms for ANN is based on constructing a multi-layer graph over the dataset, called the Hierarchical Navigable Small World (HNSW). While this algorithm supports insertion of new data, it does not support deletion of existing data. Moreover, deletion algorithms described by prior work come at the cost of increased query latency, decreased recall, or prolonged deletion time. In this paper, we propose a new theoretical framework for graph-based ANN based on random walks. We then utilize this framework to analyze a randomized deletion approach that preserves hitting time statistics compared to the graph before deleting the point. We then turn this theoretical framework into a deterministic deletion algorithm, and show that it provides better tradeoff between query latency, recall, deletion time, and memory usage through an extensive collection of experiments.

Paper Structure

This paper contains 27 sections, 18 theorems, 38 equations, 10 figures, 3 tables, 5 algorithms.

Key Result

Theorem 3.1

Let $P\subset \mathbb{R}^d$ be an $n$-point dataset preprocessed by an HNSW and $p\in P$ be a point to-be-deleted that is not the entry point. Fix a query point $q\in \mathbb{R}^d$ and suppose the search reaches layer $l\in \{1,\ldots,L\}$, let $N(p)$ denote the neighborhood of $p$ at layer $l$. Sup

Figures (10)

  • Figure 1: The deletion procedure of Algorithm \ref{['alg:sparsify']}. It proceeds by first forming a clique over the neighborhood of a deleted point, and then sparsifies the clique according to edge weights.
  • Figure 2: The rows are MPNet, SIFT, GIST and MiniLM, the columns are top-10 recall, number of distance computations per query, total deletion time and number of edges in the bottom layer of the graph. Legends: spatch -- our algorithm SPatch, fresh -- FreshDiskANN, tomb -- tombstone, nopatch -- no patching, local -- local reconnect. For MPNet: we also include rebuild without plotting its deletion time..
  • Figure 3: We construct the graph by first adding edges between $Q$ to all vertices, then perform a random walk to determine the candidate edges to keep, and sparsify them by sampling.
  • Figure 4: The impact of varying $\widehat{r}$ (i.e. $r\mu$) on transition probabilities and recall. Left: The frequency with which the random softmax algorithm truly transitions to the nearest neighbor (i.e. a greedy step), as a function of $\widehat{r}$. Right: The impact of different choices of $\widehat{r}$ on the recall of the randomized search algorithm. Horizontal line indicates the recall of greedy search algorithm.
  • Figure 5: Top left: SIFT, top right: GIST, bottom left: MPNet, bottom right: MiniLM.
  • ...and 5 more figures

Theorems & Definitions (30)

  • Theorem 3.1
  • Theorem 4.1
  • Definition 4.2
  • Theorem 4.3
  • Definition 4.4
  • Theorem 4.5: Informal version of Theorem \ref{['thm:hitting_time_formal']}
  • Corollary 4.6
  • Lemma A.1: Markov's inequality
  • Lemma A.2: Weyl's inequality, w12
  • Lemma A.3: Theorem 4.1 of w73
  • ...and 20 more