Table of Contents
Fetching ...

Cardinality Estimation for High Dimensional Similarity Queries with Adaptive Bucket Probing

Zhonghan Chen, Qintian Guo, Ruiyuan Zhang, Xiaofang Zhou

Abstract

In this work, we address the problem of cardinality estimation for similarity search in high-dimensional spaces. Our goal is to design a framework that is lightweight, easy to construct, and capable of providing accurate estimates with satisfying online efficiency. We leverage locality-sensitive hashing (LSH) to partition the vector space while preserving distance proximity. Building on this, we adopt the principles of classical multi-probe LSH to adaptively explore neighboring buckets, accounting for distance thresholds of varying magnitudes. To improve online efficiency, we employ progressive sampling to reduce the number of distance computations and utilize asymmetric distance computation in product quantization to accelerate distance calculations in high-dimensional spaces. In addition to handling static datasets, our framework includes updating algorithm designed to efficiently support large-scale dynamic scenarios of data updates.Experiments demonstrate that our methods can accurately estimate the cardinality of similarity queries, yielding satisfying efficiency.

Cardinality Estimation for High Dimensional Similarity Queries with Adaptive Bucket Probing

Abstract

In this work, we address the problem of cardinality estimation for similarity search in high-dimensional spaces. Our goal is to design a framework that is lightweight, easy to construct, and capable of providing accurate estimates with satisfying online efficiency. We leverage locality-sensitive hashing (LSH) to partition the vector space while preserving distance proximity. Building on this, we adopt the principles of classical multi-probe LSH to adaptively explore neighboring buckets, accounting for distance thresholds of varying magnitudes. To improve online efficiency, we employ progressive sampling to reduce the number of distance computations and utilize asymmetric distance computation in product quantization to accelerate distance calculations in high-dimensional spaces. In addition to handling static datasets, our framework includes updating algorithm designed to efficiently support large-scale dynamic scenarios of data updates.Experiments demonstrate that our methods can accurately estimate the cardinality of similarity queries, yielding satisfying efficiency.

Paper Structure

This paper contains 26 sections, 19 equations, 7 figures, 5 tables, 9 algorithms.

Figures (7)

  • Figure 1: Motivation of our work: $x$-axis is the distance between the central bucket $\mathcal{B}_{central}$ with its $k$-step neighbor $\mathcal{N}_k$, where $k\in[0,14]$ and $k$ is the hamming distance, and $y$-axis is the selectivity in $\mathcal{N}_k$. We use $5$ datasets, each containing $10$ queries, for demonstration. As $\mathcal{N}_k$ becomes more distant from $\mathcal{B}_{central}$, the selectivity of the neighbor decreases, which means that closer neighbor are more likely to contain points that satisfies the similarity query in context of cardinality estimation.
  • Figure 2: Efficiency: Time of Offline Estimator Construction (include all phase of construction of each methods)
  • Figure 3: Dynamic Prober: Break down time of offline construction, containing all $3$ phases
  • Figure 4: Dynamic Prober V.S. Dynamic Prober-PQ: Speedup
  • Figure 5: Parameter Study - $\epsilon$: Accuracy and Efficiency
  • ...and 2 more figures

Theorems & Definitions (9)

  • Definition 1: Similarity Search
  • Definition 2: Cardinality Estimation for Similarity Search
  • Definition 3: Euclidean Distance
  • Definition 4
  • Definition 5: Central Bucket $\mathcal{B}_{central}$
  • Definition 6: Hamming Distances
  • Definition 7: $k$-Step Neighbor of $\mathcal{B}_{central}$
  • Example 4.1
  • Example 4.2