Table of Contents
Fetching ...

VIBE: Vector Index Benchmark for Embeddings

Elias Jääsaari, Ville Hyvönen, Matteo Ceccarello, Teemu Roos, Martin Aumüller

TL;DR

VIBE addresses the need for up-to-date, open benchmarks for vector indexes performing ANN on modern embeddings, including out-of-distribution workloads. It introduces a pipeline that generates benchmark datasets from contemporary embedding models and supports OOD scenarios, quantization, and broad hardware with an accompanying interactive website for analysis. The study benchmarks 21 implementations across 12 in-distribution and 6 out-of-distribution datasets, revealing that graph- and clustering-based indexes deliver the best throughput at high recall, with quantization and GPUs offering substantial throughput gains, while OOD performance remains dataset-dependent. The work provides a practical, extensible framework for rigorous, future-proof evaluation of vector indexes in modern AI pipelines, with clear implications for deploying high-performance ANN systems in retrieval-augmented generation and multimodal search contexts.

Abstract

Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmarks. To this end, we introduce Vector Index Benchmark for Embeddings (VIBE), an open source project for benchmarking ANN algorithms. VIBE contains a pipeline for creating benchmark datasets using dense embedding models characteristic of modern applications, such as retrieval-augmented generation (RAG). To replicate real-world workloads, we also include out-of-distribution (OOD) datasets where the queries and the corpus are drawn from different distributions. We use VIBE to conduct a comprehensive evaluation of SOTA vector indexes, benchmarking 21 implementations on 12 in-distribution and 6 out-of-distribution datasets.

VIBE: Vector Index Benchmark for Embeddings

TL;DR

VIBE addresses the need for up-to-date, open benchmarks for vector indexes performing ANN on modern embeddings, including out-of-distribution workloads. It introduces a pipeline that generates benchmark datasets from contemporary embedding models and supports OOD scenarios, quantization, and broad hardware with an accompanying interactive website for analysis. The study benchmarks 21 implementations across 12 in-distribution and 6 out-of-distribution datasets, revealing that graph- and clustering-based indexes deliver the best throughput at high recall, with quantization and GPUs offering substantial throughput gains, while OOD performance remains dataset-dependent. The work provides a practical, extensible framework for rigorous, future-proof evaluation of vector indexes in modern AI pipelines, with clear implications for deploying high-performance ANN systems in retrieval-augmented generation and multimodal search contexts.

Abstract

Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmarks. To this end, we introduce Vector Index Benchmark for Embeddings (VIBE), an open source project for benchmarking ANN algorithms. VIBE contains a pipeline for creating benchmark datasets using dense embedding models characteristic of modern applications, such as retrieval-augmented generation (RAG). To replicate real-world workloads, we also include out-of-distribution (OOD) datasets where the queries and the corpus are drawn from different distributions. We use VIBE to conduct a comprehensive evaluation of SOTA vector indexes, benchmarking 21 implementations on 12 in-distribution and 6 out-of-distribution datasets.

Paper Structure

This paper contains 54 sections, 1 equation, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Algorithm performance of the fastest configuration with average recall $\ge 95\%$ on datasets with whitein-distribution and whiteout-of-distribution queries, in terms of queries per second relative to the best algorithm on each dataset. Each circle corresponds to an algorithm, arranged by decreasing average rank on the datasets. Datasets are arranged in clockwise order by decreasing difficulty (as measured by the relative contrastDBLP:conf/icml/HeKC12), with landmark being the easiest. Axes with a missing dot correspond to datasets where the algorithm could not reach 95% recall. A missing axis for the llama and yi datasets indicates that the corresponding algorithm does not support inner product similarity.
  • Figure 2: Recall/throughput tradeoff on two text embedding datasets (in-distribution queries). Graph-based SymphonyQG, Glass, and NGT-QG have the highest throughput at average recall above 90%.
  • Figure 3: Recall/throughput tradeoff on two image embedding datasets (in-distribution queries). Graph-based SymphonyQG and Glass, and clustering-based LoRANN have the highest throughput.
  • Figure 4: Recall of the 100 whitehardest and whiteeasiest queries (by RC score) on two datasets, considering the fastest configuration achieving an average recall of at least 90% (marked by the vertical line). SymphonyQG is robust w.r.t. the query difficulty, while Glass and LoRANN show larger variability.
  • Figure 5: Left: recall/throughput tradeoff on binary data. Right: recall/throughput tradeoff of GPU algorithms. The graph-based NGT-ONNG has the highest throughput on the binary data, and the graph-based CAGRA is the fastest method in the GPU setting.
  • ...and 14 more figures