Table of Contents
Fetching ...

Cracking Vector Search Indexes

Vasilis Mageirakos, Bowen Wu, Gustavo Alonso

TL;DR

CrackIVF tackles the challenge of index selection for embedding data lakes in Retrieval Augmented Generation by introducing an adaptive, incremental IVF index that evolves with the query workload. It couples two local build operations, CRACK and REFINE, with a budget-based control mechanism and a simple cost model to balance indexing and search, enabling immediate query processing and gradual performance improvement. Empirical results across standard ANN benchmarks show CrackIVF achieving near-optimal query performance while delivering 10–1000x faster initialization and the ability to process millions of queries before larger pre-built indexes finish building. This approach reduces total distance computations for builds and makes embedding-lake deployments practical for cold or unseen data, representing a significant advance in workload-aware, adaptive ANN indexing for RAG systems.

Abstract

Retrieval Augmented Generation (RAG) uses vector databases to expand the expertise of an LLM model without having to retrain it. The idea can be applied over data lakes, leading to the notion of embedding data lakes, i.e., a pool of vector databases ready to be used by RAGs. The key component in these systems is the indexes enabling Approximated Nearest Neighbor Search (ANNS). However, in data lakes, one cannot realistically expect to build indexes for every dataset. Thus, we propose an adaptive, partition-based index, CrackIVF, that performs much better than up-front index building. CrackIVF starts answering as a small index, and only expands to improve performance as it sees enough queries. It does so by progressively adapting the index to the query workload. That way, queries can be answered right away without having to build a full index first. After seeing enough queries, CrackIVF will produce an index comparable to those built with conventional techniques. CrackIVF can often answer more than 1 million queries before other approaches have even built the index, achieving 10-1000x faster initialization times. This makes it ideal for cold or infrequently used data and as a way to bootstrap access to unseen datasets.

Cracking Vector Search Indexes

TL;DR

CrackIVF tackles the challenge of index selection for embedding data lakes in Retrieval Augmented Generation by introducing an adaptive, incremental IVF index that evolves with the query workload. It couples two local build operations, CRACK and REFINE, with a budget-based control mechanism and a simple cost model to balance indexing and search, enabling immediate query processing and gradual performance improvement. Empirical results across standard ANN benchmarks show CrackIVF achieving near-optimal query performance while delivering 10–1000x faster initialization and the ability to process millions of queries before larger pre-built indexes finish building. This approach reduces total distance computations for builds and makes embedding-lake deployments practical for cold or unseen data, representing a significant advance in workload-aware, adaptive ANN indexing for RAG systems.

Abstract

Retrieval Augmented Generation (RAG) uses vector databases to expand the expertise of an LLM model without having to retrain it. The idea can be applied over data lakes, leading to the notion of embedding data lakes, i.e., a pool of vector databases ready to be used by RAGs. The key component in these systems is the indexes enabling Approximated Nearest Neighbor Search (ANNS). However, in data lakes, one cannot realistically expect to build indexes for every dataset. Thus, we propose an adaptive, partition-based index, CrackIVF, that performs much better than up-front index building. CrackIVF starts answering as a small index, and only expands to improve performance as it sees enough queries. It does so by progressively adapting the index to the query workload. That way, queries can be answered right away without having to build a full index first. After seeing enough queries, CrackIVF will produce an index comparable to those built with conventional techniques. CrackIVF can often answer more than 1 million queries before other approaches have even built the index, achieving 10-1000x faster initialization times. This makes it ideal for cold or infrequently used data and as a way to bootstrap access to unseen datasets.

Paper Structure

This paper contains 18 sections, 8 equations, 8 figures, 4 tables, 3 algorithms.

Figures (8)

  • Figure 1: Total time to answer the queries submitted for different indexing strategies vs the number of queries submitted
  • Figure 2: Observations behind the design: (a) separating the number and refinement of partitions; (b) access to the index is generally skewed towards certain regions; (c) regions queried often can be clustered and refined at a much lower granularity
  • Figure 3: Queries Per Second (QPS) vs. Recall and Time per each Query batch for the entire trace across different datasets.
  • Figure 4: Cumulative time plots across datasets.
  • Figure 5: Cumulative distance computations for index build.
  • ...and 3 more figures