Table of Contents
Fetching ...

Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor Search

Jiuqi Wei, Xiaodong Lee, Zhenyu Liao, Themis Palpanas, Botao Peng

TL;DR

The paper tackles the challenge of high-dimensional approximate nearest neighbor search by introducing Subspace Collision (SC), a framework that uses multi-subspace sampling and a SC-score as a Pareto-consistent distance proxy. Building on SC, the authors present SuCo, a lightweight indexing and querying system that uses clustering-based subspace indexes and an inverted multi-index, coupled with Dynamic Activation to efficiently count collisions and retrieve high-quality candidates. Theoretical guarantees are provided for SC-score effectiveness and k-ANN accuracy, supported by extensive experiments showing SuCo achieves 1–2 orders of magnitude faster query answering with up to one-tenth of the index memory, and often outperforms methods without guarantees on hard datasets. The work demonstrates a practical, scalable approach to guaranteed ANN in high dimensions, with strong empirical results and clear parameter guidance for real-world deployment.

Abstract

Approximate Nearest Neighbor (ANN) search in high-dimensional Euclidean spaces is a fundamental problem with a wide range of applications. However, there is currently no ANN method that performs well in both indexing and query answering performance, while providing rigorous theoretical guarantees for the quality of the answers. In this paper, we first design SC-score, a metric that we show follows the Pareto principle and can act as a proxy for the Euclidean distance between data points. Inspired by this, we propose a novel ANN search framework called Subspace Collision (SC), which can provide theoretical guarantees on the quality of its results. We further propose SuCo, which achieves efficient and accurate ANN search by designing a clustering-based lightweight index and query strategies for our proposed subspace collision framework. Extensive experiments on real-world datasets demonstrate that both the indexing and query answering performance of SuCo outperform state-of-the-art ANN methods that can provide theoretical guarantees, performing 1-2 orders of magnitude faster query answering with only up to one-tenth of the index memory footprint. Moreover, SuCo achieves top performance (best for hard datasets) even when compared to methods that do not provide theoretical guarantees. This paper was published in SIGMOD 2025.

Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor Search

TL;DR

The paper tackles the challenge of high-dimensional approximate nearest neighbor search by introducing Subspace Collision (SC), a framework that uses multi-subspace sampling and a SC-score as a Pareto-consistent distance proxy. Building on SC, the authors present SuCo, a lightweight indexing and querying system that uses clustering-based subspace indexes and an inverted multi-index, coupled with Dynamic Activation to efficiently count collisions and retrieve high-quality candidates. Theoretical guarantees are provided for SC-score effectiveness and k-ANN accuracy, supported by extensive experiments showing SuCo achieves 1–2 orders of magnitude faster query answering with up to one-tenth of the index memory, and often outperforms methods without guarantees on hard datasets. The work demonstrates a practical, scalable approach to guaranteed ANN in high dimensions, with strong empirical results and clear parameter guidance for real-world deployment.

Abstract

Approximate Nearest Neighbor (ANN) search in high-dimensional Euclidean spaces is a fundamental problem with a wide range of applications. However, there is currently no ANN method that performs well in both indexing and query answering performance, while providing rigorous theoretical guarantees for the quality of the answers. In this paper, we first design SC-score, a metric that we show follows the Pareto principle and can act as a proxy for the Euclidean distance between data points. Inspired by this, we propose a novel ANN search framework called Subspace Collision (SC), which can provide theoretical guarantees on the quality of its results. We further propose SuCo, which achieves efficient and accurate ANN search by designing a clustering-based lightweight index and query strategies for our proposed subspace collision framework. Extensive experiments on real-world datasets demonstrate that both the indexing and query answering performance of SuCo outperform state-of-the-art ANN methods that can provide theoretical guarantees, performing 1-2 orders of magnitude faster query answering with only up to one-tenth of the index memory footprint. Moreover, SuCo achieves top performance (best for hard datasets) even when compared to methods that do not provide theoretical guarantees. This paper was published in SIGMOD 2025.

Paper Structure

This paper contains 31 sections, 2 theorems, 14 figures, 6 tables, 4 algorithms.

Key Result

Theorem 1

Given a query point $q \in \mathbb{R}^d$, two independent random data points $o_1, o_2 \in$ dataset $\mathcal{D}$, and the SC-score of $o_1$ is greater than that of $o_2$, then $\left\|o_1, q\right\| < \left\|o_2, q\right\|$ holds with probability at least $1/2 - 1/e^2$ for appropriate choices of th

Figures (14)

  • Figure 1: Illustration of finding nearest neighbors using the idea of subspace and collision.
  • Figure 2: "Pareto principle" of SC-score on four datasets. Each figure contains $n=10$M $(10^7)$ scatter points, the scatter $(i,j)$ represents the average SC-score of the $i$-th NN for 1000 queries is $j$, where $i=1,2,\ldots,10^7$, and $j \in \left[0, N_s\right]$.
  • Figure 3: Overview of the SuCo workflow.
  • Figure 4: Illustration of K-means clustering ($K$=16) in 2D space using inverted index or inverted multi-index.
  • Figure 5: An illustration of the Dynamic Activation algorithm
  • ...and 9 more figures

Theorems & Definitions (6)

  • Definition 1: Collision
  • Definition 2: Subspace Collision
  • Definition 3: Subspace Sampling
  • Definition 4: SC-score
  • Theorem 1: Effectiveness of SC-score
  • Theorem 2: Quality Guarantee of ANN Search