Table of Contents
Fetching ...

Towards Robustness: A Critique of Current Vector Database Assessments

Zikai Wang, Qianxi Zhang, Baotong Lu, Qi Chen, Cheng Tan

Abstract

Vector databases are critical infrastructure in AI systems, and average recall is the dominant metric for their evaluation. Both users and researchers rely on it to choose and optimize their systems. We show that relying on average recall is problematic. It hides variability across queries, allowing systems with strong mean performance to underperform significantly on hard queries. These tail cases confuse users and can lead to failure in downstream applications such as RAG. We argue that robustness consistently achieving acceptable recall across queries is crucial to vector database evaluation. We propose Robustness-$δ$@K, a new metric that captures the fraction of queries with recall above a threshold $δ$. This metric offers a deeper view of recall distribution, helps vector index selection regarding application needs, and guides the optimization of tail performance. We integrate Robustness-$δ$@K into existing benchmarks and evaluate mainstream vector indexes, revealing significant robustness differences. More robust vector indexes yield better application performance, even with the same average recall. We also identify design factors that influence robustness, providing guidance for improving real-world performance.

Towards Robustness: A Critique of Current Vector Database Assessments

Abstract

Vector databases are critical infrastructure in AI systems, and average recall is the dominant metric for their evaluation. Both users and researchers rely on it to choose and optimize their systems. We show that relying on average recall is problematic. It hides variability across queries, allowing systems with strong mean performance to underperform significantly on hard queries. These tail cases confuse users and can lead to failure in downstream applications such as RAG. We argue that robustness consistently achieving acceptable recall across queries is crucial to vector database evaluation. We propose Robustness-@K, a new metric that captures the fraction of queries with recall above a threshold . This metric offers a deeper view of recall distribution, helps vector index selection regarding application needs, and guides the optimization of tail performance. We integrate Robustness-@K into existing benchmarks and evaluate mainstream vector indexes, revealing significant robustness differences. More robust vector indexes yield better application performance, even with the same average recall. We also identify design factors that influence robustness, providing guidance for improving real-world performance.

Paper Structure

This paper contains 54 sections, 8 equations, 14 figures.

Figures (14)

  • Figure 1: Recall distribution of ScaNN and DiskANN on MSMARCO, each achieving an average Recall@10 of 0.9. Queries returning zero ground-truth items are highlighted with a red frame. Recall@10=0.9 and Recall@10=1.0 query results are shown above their bars.
  • Figure 2: Overview of a graph-based index (left) and a partition-based index (right). In both cases, the query is represented as a red star, and dataset points are shown as blue, orange, and green dots, with dots bordered in red indicating the top 5 nearest neighbors to the query. In the graph-based index (a), dashed lines represent edges between vectors in the graph. A hollow dot indicates the entry point of the graph search, while red arrows trace the search path. In the partition-based index (b), each color corresponds to a distinct partition, and dashed lines denote partition boundaries. Hollow dots represent partition centroids, with those bordered in red being the top 2 nearest centroids to the query.
  • Figure 3: Correlation ($r^2$) between each metric and average Recall@10, computed across all index configurations on four datasets. Green (low $r^2$): the metric captures information beyond recall. Red (high $r^2$): the metric is redundant with recall.
  • Figure 4: Dataset characteristics.
  • Figure 5: Index parameters for all experiments. For graph-based indexes, M and R represent the maximum degree of a node; efConstruction and L represent the search list length during building; efSearch and Ls represent the search list length during searching. For partition-based indexes, n_list and #leaves represent the number of clusters; n_probe and #l_search represent the number of clusters searched. In ScaNN, ro_#n is short for reorder_num_neighbors. It represents the number of KNNs to be reranked. #leaves is short for num_leaves, and #l_search is short for num_leaves_to_search. In Puck, s_#c represents the number of coarse, and s_range is short for tinker_search_range, which represents the number of finer clusters searched. Zilliz is excluded from MSMARCO because the Docker image has a bug quantizing 768d vectors (unfixable without vendor patch).
  • ...and 9 more figures