Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor Search
Jiuqi Wei, Xiaodong Lee, Zhenyu Liao, Themis Palpanas, Botao Peng
TL;DR
The paper tackles the challenge of high-dimensional approximate nearest neighbor search by introducing Subspace Collision (SC), a framework that uses multi-subspace sampling and a SC-score as a Pareto-consistent distance proxy. Building on SC, the authors present SuCo, a lightweight indexing and querying system that uses clustering-based subspace indexes and an inverted multi-index, coupled with Dynamic Activation to efficiently count collisions and retrieve high-quality candidates. Theoretical guarantees are provided for SC-score effectiveness and k-ANN accuracy, supported by extensive experiments showing SuCo achieves 1–2 orders of magnitude faster query answering with up to one-tenth of the index memory, and often outperforms methods without guarantees on hard datasets. The work demonstrates a practical, scalable approach to guaranteed ANN in high dimensions, with strong empirical results and clear parameter guidance for real-world deployment.
Abstract
Approximate Nearest Neighbor (ANN) search in high-dimensional Euclidean spaces is a fundamental problem with a wide range of applications. However, there is currently no ANN method that performs well in both indexing and query answering performance, while providing rigorous theoretical guarantees for the quality of the answers. In this paper, we first design SC-score, a metric that we show follows the Pareto principle and can act as a proxy for the Euclidean distance between data points. Inspired by this, we propose a novel ANN search framework called Subspace Collision (SC), which can provide theoretical guarantees on the quality of its results. We further propose SuCo, which achieves efficient and accurate ANN search by designing a clustering-based lightweight index and query strategies for our proposed subspace collision framework. Extensive experiments on real-world datasets demonstrate that both the indexing and query answering performance of SuCo outperform state-of-the-art ANN methods that can provide theoretical guarantees, performing 1-2 orders of magnitude faster query answering with only up to one-tenth of the index memory footprint. Moreover, SuCo achieves top performance (best for hard datasets) even when compared to methods that do not provide theoretical guarantees. This paper was published in SIGMOD 2025.
