Table of Contents
Fetching ...

CRISP: Correlation-Resilient Indexing via Subspace Partitioning

Dimitris Dimitropoulos, Achilleas Michalopoulos, Dimitrios Tsitsigkos, Nikos Mamoulis

TL;DR

This work introduces CRISP, a novel framework designed for ANN search in very-high-dimensional spaces that employs a lightweight, correlation- aware adaptive strategy that redistributes variance only when necessary, effectively reducing the preprocessing complexity.

Abstract

As the dimensionality of modern learned representations increases to thousands of dimensions, the state-of-the-art Approximate Nearest Neighbor (ANN) indices exhibit severe limitations. Graph-based methods (e.g., HNSW) suffer from prohibitive memory consumption and routing degradation, while recent randomized quantization and learned rotation approaches (e.g., RaBitQ, OPQ) impose significant preprocessing overheads. We introduce CRISP, a novel framework designed for ANN search in very-high-dimensional spaces. Unlike rigid pipelines that apply expensive orthogonal rotations indiscriminately, CRISP employs a lightweight, correlation- aware adaptive strategy that redistributes variance only when necessary, effectively reducing the preprocessing complexity. We couple this adaptive mechanism with a cache-coherent Compressed Sparse Row (CSR) index structure. Furthermore, CRISP incorporates a multi-stage dual-mode query engine: a Guaranteed Mode that preserves rigorous theoretical lower bounds on recall, and an Optimized Mode that leverages rank-based weighted scoring and early termination to reduce query latency. Extensive evaluation on datasets of very high dimensionality (up to 4096) demonstrates that CRISP achieves state-of-the-art query throughput, low construction costs, and peak memory efficiency.

CRISP: Correlation-Resilient Indexing via Subspace Partitioning

TL;DR

This work introduces CRISP, a novel framework designed for ANN search in very-high-dimensional spaces that employs a lightweight, correlation- aware adaptive strategy that redistributes variance only when necessary, effectively reducing the preprocessing complexity.

Abstract

As the dimensionality of modern learned representations increases to thousands of dimensions, the state-of-the-art Approximate Nearest Neighbor (ANN) indices exhibit severe limitations. Graph-based methods (e.g., HNSW) suffer from prohibitive memory consumption and routing degradation, while recent randomized quantization and learned rotation approaches (e.g., RaBitQ, OPQ) impose significant preprocessing overheads. We introduce CRISP, a novel framework designed for ANN search in very-high-dimensional spaces. Unlike rigid pipelines that apply expensive orthogonal rotations indiscriminately, CRISP employs a lightweight, correlation- aware adaptive strategy that redistributes variance only when necessary, effectively reducing the preprocessing complexity. We couple this adaptive mechanism with a cache-coherent Compressed Sparse Row (CSR) index structure. Furthermore, CRISP incorporates a multi-stage dual-mode query engine: a Guaranteed Mode that preserves rigorous theoretical lower bounds on recall, and an Optimized Mode that leverages rank-based weighted scoring and early termination to reduce query latency. Extensive evaluation on datasets of very high dimensionality (up to 4096) demonstrates that CRISP achieves state-of-the-art query throughput, low construction costs, and peak memory efficiency.
Paper Structure (21 sections, 1 theorem, 6 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 1 theorem, 6 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

theorem 1

(Conditional Recall Lower Bound). Let $x^*$ be the true nearest neighbor of query $q$ with single-subspace collision probability $p^*$. Let $\tau$ be the selection threshold defined by $\tau = \alpha \cdot M$. If CRISP is configured in Guaranteed Mode, the probability that $x^*$ is successfully retr subject to the condition that the expected collision count $\mu = Mp^*$ strictly exceeds $\tau$.

Figures (8)

  • Figure 1: Overview of the CRISP Architecture.
  • Figure 2: CRISP replaces fragmented hash-based traversal (a) by a Compressed Sparse Row (CSR) layout (b), bottleneck shift from memory latency to peak memory bandwidth.
  • Figure 3: Candidate scoring on a $5 \times 5$ centroid grid. Axes show sorted partial distances $(0, 1, 2, \dots)$ in each subspace half. Cells are visited in ascending order of cost (Ranks 1-13) until sufficient candidates are retrieved. (a) Guaranteed Mode: Uniform weighting ($w=1$) for all candidates (gray). (b) Optimized Mode: The first $k_{size}$ cells (dark gray, Ranks 1-6) with lowest costs receive double weight ($w=2$) to prioritize likely nearest neighbors. The remaining cells (light gray) use standard weight ($w=1$).
  • Figure 4: Minimum index construction time required to reach specific Recall@100 thresholds (80%, 85%, 90%, 95%, 99%) across nine benchmark datasets (log-scale y-axis). Missing bars indicate that a method could not achieve the required recall threshold with the tested configurations. Lower is better.
  • Figure 5: Recall@100 vs. QPS Pareto frontiers across nine benchmark datasets (log-scale y-axis). Each subplot shows the optimal throughput-accuracy trade-off for each method. Higher and further right is better.
  • ...and 3 more figures

Theorems & Definitions (1)

  • theorem 1