Table of Contents
Fetching ...

$\boldsymbol{Steiner}$-Hardness: A Query Hardness Measure for Graph-Based ANN Indexes

Zeyu Wang, Qitong Wang, Xiaoxing Cheng, Peng Wang, Themis Palpanas, Wei Wang

TL;DR

This work addresses the instability of query performance in graph-based ANN indexes by introducing Steiner-hardness, a graph-native hardness measure defined as the minimum query effort on a representative MRNG graph and computed via connections to Directed Steiner Tree formulations. The authors develop a comprehensive framework for Minimum Effort (ME), adapt it to greedy graph searches, incorporate decision costs, and map the problem to DST and related problems with efficient Steiner-tree solvers. They demonstrate that Steiner-hardness correlates more strongly with actual query effort than prior measures and use unbiased workload generation to reveal robust index behavior, challenging some expectations from old benchmarks. The practical impact lies in providing a principled, graph-aware metric and benchmarks to guide index selection and robustness improvements in real-world high-dimensional similarity search systems.

Abstract

Graph-based indexes have been widely employed to accelerate approximate similarity search of high-dimensional vectors. However, the performance of graph indexes to answer different queries varies vastly, leading to an unstable quality of service for downstream applications. This necessitates an effective measure to test query hardness on graph indexes. Nonetheless, popular distance-based hardness measures like LID lose their effects due to the ignorance of the graph structure. In this paper, we propose $Steiner$-hardness, a novel connection-based graph-native query hardness measure. Specifically, we first propose a theoretical framework to analyze the minimum query effort on graph indexes and then define $Steiner$-hardness as the minimum effort on a representative graph. Moreover, we prove that our $Steiner$-hardness is highly relevant to the classical Directed $Steiner$ Tree (DST) problems. In this case, we design a novel algorithm to reduce our problem to DST problems and then leverage their solvers to help calculate $Steiner$-hardness efficiently. Compared with LID and other similar measures, $Steiner$-hardness shows a significantly better correlation with the actual query effort on various datasets. Additionally, an unbiased evaluation designed based on $Steiner$-hardness reveals new ranking results, indicating a meaningful direction for enhancing the robustness of graph indexes. This paper is accepted by PVLDB 2025.

$\boldsymbol{Steiner}$-Hardness: A Query Hardness Measure for Graph-Based ANN Indexes

TL;DR

This work addresses the instability of query performance in graph-based ANN indexes by introducing Steiner-hardness, a graph-native hardness measure defined as the minimum query effort on a representative MRNG graph and computed via connections to Directed Steiner Tree formulations. The authors develop a comprehensive framework for Minimum Effort (ME), adapt it to greedy graph searches, incorporate decision costs, and map the problem to DST and related problems with efficient Steiner-tree solvers. They demonstrate that Steiner-hardness correlates more strongly with actual query effort than prior measures and use unbiased workload generation to reveal robust index behavior, challenging some expectations from old benchmarks. The practical impact lies in providing a principled, graph-aware metric and benchmarks to guide index selection and robustness improvements in real-world high-dimensional similarity search systems.

Abstract

Graph-based indexes have been widely employed to accelerate approximate similarity search of high-dimensional vectors. However, the performance of graph indexes to answer different queries varies vastly, leading to an unstable quality of service for downstream applications. This necessitates an effective measure to test query hardness on graph indexes. Nonetheless, popular distance-based hardness measures like LID lose their effects due to the ignorance of the graph structure. In this paper, we propose -hardness, a novel connection-based graph-native query hardness measure. Specifically, we first propose a theoretical framework to analyze the minimum query effort on graph indexes and then define -hardness as the minimum effort on a representative graph. Moreover, we prove that our -hardness is highly relevant to the classical Directed Tree (DST) problems. In this case, we design a novel algorithm to reduce our problem to DST problems and then leverage their solvers to help calculate -hardness efficiently. Compared with LID and other similar measures, -hardness shows a significantly better correlation with the actual query effort on various datasets. Additionally, an unbiased evaluation designed based on -hardness reveals new ranking results, indicating a meaningful direction for enhancing the robustness of graph indexes. This paper is accepted by PVLDB 2025.
Paper Structure (26 sections, 5 theorems, 5 equations, 17 figures, 2 tables, 2 algorithms)

This paper contains 26 sections, 5 theorems, 5 equations, 17 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Given a graph $G(V,E)$, a query $q$ with its $k$NN $N_k$,

Figures (17)

  • Figure 1: Query performance variance on graph indexes. (a) Histograms of NDC to reach 90% recall on Deep deep dataset. (b) A real example on a RAG task rag, where the low recall of hard queries impairs model accuracy.
  • Figure 2: Comparison of LID and ME on the same dataset.
  • Figure 3: The correlation between $Steiner$-hardness (b) and NDC to reach 90% recall is much stronger than LID (a).
  • Figure 4: An illustrative example of our ME definitions with $k$=5. The orange points and edges form $Y$. (a) $Acc$=100%, (b) $Acc$=80%, (c) entry point is limited to be in $N_k$, (d) $p$=0.4, (e) Limited range of candidates, (f) $ME-exhaustive$.
  • Figure 5: Query time breakdown on 1,000 queries ($k$=50).
  • ...and 12 more figures

Theorems & Definitions (14)

  • Definition 1: $k$NN Query
  • Definition 2: $ME@Acc$
  • Definition 3: $ME_{\delta}^p@Acc$
  • Definition 4: critical point $\delta_0$
  • Definition 5: Decision cost
  • Definition 6: $ME_{\delta}^p@Acc-exhaustive$
  • Definition 7: Directed Steiner Tree (DST)
  • Theorem 1
  • Definition 8: vertex-focused Directed Steiner Network (vDSN)
  • Theorem 2
  • ...and 4 more