Table of Contents
Fetching ...

Fast and exact fixed-radius neighbor search based on sorting

Xinye Chen, Stefan Güttel

TL;DR

The paper tackles exact fixed-radius nearest neighbor search by introducing SNN, a sorting-based method that achieves exact results with no hyperparameters beyond the search radius. By centering data, projecting onto the first principal component, and sorting points by the resulting score, SNN prunes candidate points and uses a BLAS-enabled, matrix-based formulation to accelerate distance checks. Theoretical analysis links pruning efficiency to data geometry via the singular values, and experiments demonstrate substantial speedups over tree-based methods and brute force, including clear benefits for clustering with DBSCAN. The work shows strong practical impact across synthetic and real-world datasets, with potential for online and GPU-accelerated deployments.

Abstract

Fixed-radius near neighbor search is a fundamental data operation that retrieves all data points within a user-specified distance to a query point. There are efficient algorithms that can provide fast approximate query responses, but they often have a very compute-intensive indexing phase and require careful parameter tuning. Therefore, exact brute force and tree-based search methods are still widely used. Here we propose a new fixed-radius near neighbor search method, called SNN, that significantly improves over brute force and tree-based methods in terms of index and query time, provably returns exact results, and requires no parameter tuning. SNN exploits a sorting of the data points by their first principal component to prune the query search space. Further speedup is gained from an efficient implementation using high-level Basic Linear Algebra Subprograms (BLAS). We provide theoretical analysis of our method and demonstrate its practical performance when used stand-alone and when applied within the DBSCAN clustering algorithm.

Fast and exact fixed-radius neighbor search based on sorting

TL;DR

The paper tackles exact fixed-radius nearest neighbor search by introducing SNN, a sorting-based method that achieves exact results with no hyperparameters beyond the search radius. By centering data, projecting onto the first principal component, and sorting points by the resulting score, SNN prunes candidate points and uses a BLAS-enabled, matrix-based formulation to accelerate distance checks. Theoretical analysis links pruning efficiency to data geometry via the singular values, and experiments demonstrate substantial speedups over tree-based methods and brute force, including clear benefits for clustering with DBSCAN. The work shows strong practical impact across synthetic and real-world datasets, with potential for online and GPU-accelerated deployments.

Abstract

Fixed-radius near neighbor search is a fundamental data operation that retrieves all data points within a user-specified distance to a query point. There are efficient algorithms that can provide fast approximate query responses, but they often have a very compute-intensive indexing phase and require careful parameter tuning. Therefore, exact brute force and tree-based search methods are still widely used. Here we propose a new fixed-radius near neighbor search method, called SNN, that significantly improves over brute force and tree-based methods in terms of index and query time, provably returns exact results, and requires no parameter tuning. SNN exploits a sorting of the data points by their first principal component to prune the query search space. Further speedup is gained from an efficient implementation using high-level Basic Linear Algebra Subprograms (BLAS). We provide theoretical analysis of our method and demonstrate its practical performance when used stand-alone and when applied within the DBSCAN clustering algorithm.
Paper Structure (13 sections, 26 equations, 3 figures, 7 tables, 2 algorithms)

This paper contains 13 sections, 26 equations, 3 figures, 7 tables, 2 algorithms.

Figures (3)

  • Figure 1: Query with radius $R$. The data points in the shaded band have their first principal coordinate within a distance $R$ from the first principal coordinate of the query point, and hence are NN candidates. All data points are sorted so that all candidates have continuous indices.
  • Figure 2: Comparing SNN to brute force search and tree-based methods. Total index time (top) and average query time (bottom) for the synthetic uniformly distributed dataset, all in seconds, as the data size $n$ is varied (left) or the dimension $d$ is varied (right). Brute force query methods do not require an index construction, hence are omitted on the left. Our SNN method is the best performer in all cases, in some cases 10 times faster than the best tree-based method (balltree).
  • Figure 3: Comparing GriSPy and SNN. Total index time (top) and average query time (bottom) for on uniformly distributed data, all in seconds, as the data size $n$ is varied (left) or the dimension $d$ is varied (right). Our SNN method significantly outperforms GriSPy both in terms of indexing and query runtime.