Table of Contents
Fetching ...

Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph

Cong Fu, Chao Xiang, Changxu Wang, Deng Cai

TL;DR

This work tackles scalable approximate nearest neighbor search (ANNS) for billion-point datasets by introducing a theoretical graph family (MSNET) and two graph structures: MRNG (Monotonic Relative Neighborhood Graph) and NSG (Navigating Spreading-out Graph). MRNG provides near-logarithmic search complexity via monotone paths, and NSG offers a practical approximation that preserves connectivity while keeping the average degree small to enable fast traversal. The authors prove key monotonicity properties, analyze indexing costs, and demonstrate through extensive experiments on public datasets, large-scale synthetic data, a billion-scale Taobao deployment, and the DEEP1B subset that NSG achieves superior high-precision performance with significantly smaller memory footprints. The results establish NSG as a competitive, scalable graph-based ANNS approach suitable for industrial-scale search systems.

Abstract

Approximate nearest neighbor search (ANNS) is a fundamental problem in databases and data mining. A scalable ANNS algorithm should be both memory-efficient and fast. Some early graph-based approaches have shown attractive theoretical guarantees on search time complexity, but they all suffer from the problem of high indexing time complexity. Recently, some graph-based methods have been proposed to reduce indexing complexity by approximating the traditional graphs; these methods have achieved revolutionary performance on million-scale datasets. Yet, they still can not scale to billion-node databases. In this paper, to further improve the search-efficiency and scalability of graph-based methods, we start by introducing four aspects: (1) ensuring the connectivity of the graph; (2) lowering the average out-degree of the graph for fast traversal; (3) shortening the search path; and (4) reducing the index size. Then, we propose a novel graph structure called Monotonic Relative Neighborhood Graph (MRNG) which guarantees very low search complexity (close to logarithmic time). To further lower the indexing complexity and make it practical for billion-node ANNS problems, we propose a novel graph structure named Navigating Spreading-out Graph (NSG) by approximating the MRNG. The NSG takes the four aspects into account simultaneously. Extensive experiments show that NSG outperforms all the existing algorithms significantly. In addition, NSG shows superior performance in the E-commercial search scenario of Taobao (Alibaba Group) and has been integrated into their search engine at billion-node scale.

Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph

TL;DR

This work tackles scalable approximate nearest neighbor search (ANNS) for billion-point datasets by introducing a theoretical graph family (MSNET) and two graph structures: MRNG (Monotonic Relative Neighborhood Graph) and NSG (Navigating Spreading-out Graph). MRNG provides near-logarithmic search complexity via monotone paths, and NSG offers a practical approximation that preserves connectivity while keeping the average degree small to enable fast traversal. The authors prove key monotonicity properties, analyze indexing costs, and demonstrate through extensive experiments on public datasets, large-scale synthetic data, a billion-scale Taobao deployment, and the DEEP1B subset that NSG achieves superior high-precision performance with significantly smaller memory footprints. The results establish NSG as a competitive, scalable graph-based ANNS approach suitable for industrial-scale search systems.

Abstract

Approximate nearest neighbor search (ANNS) is a fundamental problem in databases and data mining. A scalable ANNS algorithm should be both memory-efficient and fast. Some early graph-based approaches have shown attractive theoretical guarantees on search time complexity, but they all suffer from the problem of high indexing time complexity. Recently, some graph-based methods have been proposed to reduce indexing complexity by approximating the traditional graphs; these methods have achieved revolutionary performance on million-scale datasets. Yet, they still can not scale to billion-node databases. In this paper, to further improve the search-efficiency and scalability of graph-based methods, we start by introducing four aspects: (1) ensuring the connectivity of the graph; (2) lowering the average out-degree of the graph for fast traversal; (3) shortening the search path; and (4) reducing the index size. Then, we propose a novel graph structure called Monotonic Relative Neighborhood Graph (MRNG) which guarantees very low search complexity (close to logarithmic time). To further lower the indexing complexity and make it practical for billion-node ANNS problems, we propose a novel graph structure named Navigating Spreading-out Graph (NSG) by approximating the MRNG. The NSG takes the four aspects into account simultaneously. Extensive experiments show that NSG outperforms all the existing algorithms significantly. In addition, NSG shows superior performance in the E-commercial search scenario of Taobao (Alibaba Group) and has been integrated into their search engine at billion-node scale.

Paper Structure

This paper contains 36 sections, 5 theorems, 1 equation, 12 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Given a finite point set $S$ of $n$ points, randomly distributed in space $E^d$ and a monotonic search network $G$ constructed on $S$, a monotonic path between any two nodes $p,q$ in $G$ can be found by Algorithm search_alg without backtracking.

Figures (12)

  • Figure 1: (a) is the tree index, (b) is the hashing index, and (c) is the graph index. The red star is the query (not included in the base data). The four red rings are its nearest neighbors. The tree and hashing index partition the space into several cells. Let each cell contain no more than three points. The out-degree of each node in the graph index is also no more than three. To retrieve the nearest neighbors of the query, we need to backtrack and check many leaf nodes for the tree index. We need to check nearby buckets with hamming radius 2 for the hashing index. As for the graph index, Algorithm 1 forms a search path as the red lines show. The orange circles are checked points during their search. The graph-based algorithm needs the least times of distance calculation.
  • Figure 2: An illustration of the search in an MSNET. The query point is $q$ and the search starts with point $p$. At each step, Algorithm \ref{['search_alg']} will select a node that is the closest to $q$ among the neighbors of the current nodes. Suppose $p,r,s$ is on a monotonic path selected by Algorithm \ref{['search_alg']}. The search region shrinks from sphere $B(q,\delta(p,q))$ to $B(q,\delta(r,q))$, then to $B(q,\delta(s,q))$. The number of nodes in each sphere (may be checked) decreases by some ratio at each step until only $q$ is left in the final sphere.
  • Figure 3: A comparison between the edge selection strategy of the RNG (a) and the MRNG (b). An RNG is an undirected graph, while an MRNG is a directed one. In (a), $p$ and $r$ are linked because there is no node in $lune_{pr}$. Because $r \in lune_{ps}$, $s \in lune_{pt}$, $t \in lune_{pu}$, and $u \in lune_{pq}$, there are no edges between $p$ and $s,t,u,q$. In (b), $p$ and $r$ are linked because there is no node in $lune_{pr}$. $p$ and $s$ are not linked because $r \in lune_{ps}$ and $pr, sr \in$MRNG. Directed edge $\overset{\longrightarrow}{pt} \in$MRNG because $\overset{\longrightarrow}{ps} \notin$MRNG. However, $\overset{\longrightarrow}{tp} \notin$MRNG because $\overset{\longrightarrow}{ts} \in$MRNG. We can see that the MRNG is defined in a recursive way, and the edge selection strategy of the RNG is more strict than MRNG's. In the RNG(a), there is a monotonic path from $q$ to $p$, but no monotonic path from $p$ to $q$. In the MRNG(b), there is at least one monotonic path from any node to another node.
  • Figure 4: An illustration of the necessity that NNG$\subset$MRNG. If not, the graph cannot be an MSNET. Path $p,q,r,s,t$ is an example of non-monotonic path from $p$ to $t$. In this graph, $t$ is the nearest neighbor of $q$ but not linked to $q$. We apply the MRNG's edge selection strategy on this graph. According to the definition of the strategy, $t$ and $r$ can never be linked. When the search goes from $p$ to $t$, it must detour with at least one more step through $s$. This problem will be worse in practice.
  • Figure 5: An illustration of the candidates of edge selection in NSG. Node $p$ is the node to be processed, and $m$ is the Navigating Node. The red nodes are the $k$ nearest neighbors of node $p$. The big black nodes and the solid lines form a possible monotonic path from $m$ to $p$, generated by the search-and-collect routine. The small black nodes are the nodes visited by the search-and-collect routine. All the nodes in the figure will be added to the candidate set of $p$.
  • ...and 7 more figures

Theorems & Definitions (11)

  • Definition 1: Nearest Neighbor Search
  • Definition 2: $\epsilon$-Nearest Neighbor Search
  • Definition 3: Monotonic Path
  • Definition 4: Monotonic Search Network
  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Definition 5: MRNG
  • Theorem 3
  • Definition 6: NNG
  • ...and 1 more