Table of Contents
Fetching ...

TaCo: Data-adaptive and Query-aware Subspace Collision for High-dimensional Approximate Nearest Neighbor Search

Jiuqi Wei, Zhenyu Liao, Ruoyu Han, Quanqing Xu, Chuanhui Yang, Themis Palpanas

Abstract

Approximate Nearest Neighbor Search (ANNS) in high-dimensional Euclidean spaces is a fundamental problem with broad applications. Subspace Collision is a newly proposed ANNS framework that provides a novel paradigm for similarity search and achieves superior indexing and query performance. However, the subspace collision framework remains data-agnostic and query-oblivious, resulting in imbalanced index construction and wasted query overhead. In this paper, we address these limitations from two aspects: first, we design a subspace-oriented data transformation mechanism by averaging the entropies computed over each subspace of the transformed data, which ensures balanced subspace partitioning (in an information theoretical sense) and enables data-adaptive subspace collision; second, we present query-aware and scalable query strategies that dynamically allocate overhead for each query and accelerate collision probing within subspaces. Building on these ideas, we propose a novel data-adaptive and query-aware subspace collision method, abbreviated as TaCo, which achieves efficient and accurate ANN search while maintaining an excellent balance between indexing and query performance. Extensive experiments on real-world datasets demonstrate that, when compared to state-of-the-art subspace collision methods, TaCo achieves up to 8x speedup in indexing and reduces to 0.6x memory footprint, while achieving over 1.5x query throughput. Moreover, TaCo achieves state-of-the-art indexing performance and provides an effective balance between indexing and query efficiency, even when compared with advanced methods beyond the subspace-collision paradigm. This paper was published in SIGMOD 2026.

TaCo: Data-adaptive and Query-aware Subspace Collision for High-dimensional Approximate Nearest Neighbor Search

Abstract

Approximate Nearest Neighbor Search (ANNS) in high-dimensional Euclidean spaces is a fundamental problem with broad applications. Subspace Collision is a newly proposed ANNS framework that provides a novel paradigm for similarity search and achieves superior indexing and query performance. However, the subspace collision framework remains data-agnostic and query-oblivious, resulting in imbalanced index construction and wasted query overhead. In this paper, we address these limitations from two aspects: first, we design a subspace-oriented data transformation mechanism by averaging the entropies computed over each subspace of the transformed data, which ensures balanced subspace partitioning (in an information theoretical sense) and enables data-adaptive subspace collision; second, we present query-aware and scalable query strategies that dynamically allocate overhead for each query and accelerate collision probing within subspaces. Building on these ideas, we propose a novel data-adaptive and query-aware subspace collision method, abbreviated as TaCo, which achieves efficient and accurate ANN search while maintaining an excellent balance between indexing and query performance. Extensive experiments on real-world datasets demonstrate that, when compared to state-of-the-art subspace collision methods, TaCo achieves up to 8x speedup in indexing and reduces to 0.6x memory footprint, while achieving over 1.5x query throughput. Moreover, TaCo achieves state-of-the-art indexing performance and provides an effective balance between indexing and query efficiency, even when compared with advanced methods beyond the subspace-collision paradigm. This paper was published in SIGMOD 2026.

Paper Structure

This paper contains 33 sections, 4 theorems, 8 equations, 12 figures, 3 tables, 6 algorithms.

Key Result

Theorem 1

Assume that the data sample covariance $\hat{\Sigma}$ has distinct eigenvalues that are all greater than or equal to one. Then, the Eigensystem Allocation method in Algorithm eigensystem_allocation solves the optimization problem in Equation eq:opt_1.

Figures (12)

  • Figure 1: "Pareto principle" of SC-score.
  • Figure 2: Illustration of the proposed subspace-oriented entropy averaging approach, with random data vector $o$ of mean zero and covariance $\Sigma$ represented by a 3D ellipsoid.
  • Figure 3: SC-score of the transformed data still follows the Pareto principle as the original data in Figure \ref{['scscore']}.
  • Figure 4: Overview of the TaCo workflow.
  • Figure 5: Efficiency comparison between Scalable Dynamic Activation and Dynamic Activation algorithms on SIFT10M.
  • ...and 7 more figures

Theorems & Definitions (11)

  • Definition 1: Nearest Neighbor Search, NNS
  • Definition 2: $k$-Nearest Neighbor Search, $k$-NNS
  • Definition 3: $k$-Approximate Nearest Neighbor Search, $k$-ANNS
  • Definition 4: Subspace Sampling
  • Definition 5: Subspace Collision
  • Definition 6: SC-score
  • Theorem 1: Performance Guarantee for Algorithm \ref{['eigensystem_allocation']}
  • Lemma 1: Local Distance Preservation for Algorithm \ref{['eigensystem_allocation']}
  • Theorem 2: Relative Neighborhood Ordering Preservation for Algorithm \ref{['data_transformation']}
  • Definition 7: Query Discriminability
  • ...and 1 more