Table of Contents
Fetching ...

Foundations of Vector Retrieval

Sebastian Bruch

TL;DR

Foundations of Vector Retrieval surveys the core theory and algorithmic toolkit for retrieving top-$k$ similar vectors across modalities. It cohesively presents four main algorithmic families—branch-and-bound trees, locality-sensitive hashing, graph-based methods, and clustering—and augments them with a detailed treatment of vector compression via quantization (PQ and AQ) to enable scalable indexing. A central thread is the tension between exactness and scalability in high dimensions, explored through intrinsic dimensionality, doubling dimension, and instability results, while offering principled guarantees under doubling-measure assumptions and constructive bounds for various methods. The work emphasizes the relevance of vector retrieval to real-time search, recommender systems, and retrieval-augmented generation, and frames future directions around improved theory-experiment alignment, scalable graph structures, and advanced compression techniques for billion-scale vector databases.

Abstract

Vectors are universal mathematical objects that can represent text, images, speech, or a mix of these data modalities. That happens regardless of whether data is represented by hand-crafted features or learnt embeddings. Collect a large enough quantity of such vectors and the question of retrieval becomes urgently relevant: Finding vectors that are more similar to a query vector. This monograph is concerned with the question above and covers fundamental concepts along with advanced data structures and algorithms for vector retrieval. In doing so, it recaps this fascinating topic and lowers barriers of entry into this rich area of research.

Foundations of Vector Retrieval

TL;DR

Foundations of Vector Retrieval surveys the core theory and algorithmic toolkit for retrieving top- similar vectors across modalities. It cohesively presents four main algorithmic families—branch-and-bound trees, locality-sensitive hashing, graph-based methods, and clustering—and augments them with a detailed treatment of vector compression via quantization (PQ and AQ) to enable scalable indexing. A central thread is the tension between exactness and scalability in high dimensions, explored through intrinsic dimensionality, doubling dimension, and instability results, while offering principled guarantees under doubling-measure assumptions and constructive bounds for various methods. The work emphasizes the relevance of vector retrieval to real-time search, recommender systems, and retrieval-augmented generation, and frames future directions around improved theory-experiment alignment, scalable graph structures, and advanced compression techniques for billion-scale vector databases.

Abstract

Vectors are universal mathematical objects that can represent text, images, speech, or a mix of these data modalities. That happens regardless of whether data is represented by hand-crafted features or learnt embeddings. Collect a large enough quantity of such vectors and the question of retrieval becomes urgently relevant: Finding vectors that are more similar to a query vector. This monograph is concerned with the question above and covers fundamental concepts along with advanced data structures and algorithms for vector retrieval. In doing so, it recaps this fascinating topic and lowers barriers of entry into this rich area of research.
Paper Structure (149 sections, 68 theorems, 262 equations, 25 figures, 2 tables, 8 algorithms)

This paper contains 149 sections, 68 theorems, 262 equations, 25 figures, 2 tables, 8 algorithms.

Key Result

theorem 1

Suppose data points $\mathcal{X}$ are independent and identically distributed (iid) in each dimension and drawn from a zero-mean distribution. Then, for any $u \in \mathcal{X}$:

Figures (25)

  • Figure 1: Vector representation of a piece of text by adopting a "bag of words" view: A text document, when stripped of grammar and word order, can be thought of as a vector, where each coordinate represents a term in our vocabulary and its value records the frequency of that term in the document or some function of it. The resulting vectors are typically sparse; that is, they have very few non-zero coordinates.
  • Figure 2: Variants of vector retrieval for a toy vector collection in $\mathbb{R}^2$. In Nearest Neighbor search, we find the data point whose $L_2$ distance to the query point is minimal ($v$ for top-$1$ search). In Maximum Cosine Similarity search, we instead find the point whose angular distance to the query point is minimal ($v$ and $p$ are equidistant from the query). In Maximum Inner Product Search, we find a vector that maximizes the inner product with the query vector. This can be understood as letting the hyperplane orthogonal to the query point sweep the space towards the origin; the first vector to touch the sweeping plane is the maximizer of inner product. Another interpretation is this: the shaded region in the figure contains all the points $y$ for which $p$ is the answer to $\mathop{\mathrm{arg\,max}}\limits_{x \in \{ u, v, w, p\}} \langle x, y \rangle$.
  • Figure 3: Probability that $u \in \mathcal{X}$ is the solution to MIPS over $\mathcal{X}$ with query $u$ versus the dimensionality $d$, for various synthetic and real collections $\mathcal{X}$. For synthetic collections, $\lvert \mathcal{X} \rvert = 100{,}000$. Appendix \ref{['appendix:collections']} gives a description of the real collections. Note that, for real collections, we estimate the reported probability by sampling $10{,}000$ data points and using them as queries. Furthermore, we do not pre-process the vectors---importantly, we do not $L_2$-normalize the collections.
  • Figure 4: Approximate variants of top-$1$ retrieval for a toy collection in $\mathbb{R}^2$. In NN, we admit vectors that are at most $\epsilon$ away from the optimal solution. As such, $x$ and $y$ are both valid solutions as they are in a ball with radius $(1+\epsilon) \delta(q, x)$ centered at the query. Similarly, in MCS, we accept a vector (e.g., $x$) if its angle with the query point is at most $1 + \epsilon$ greater than the angle between the query and the optimal vector (i.e., $v$). For the MIPS example, assuming that the inner product of query and $x$ is at most $(1 - \epsilon)$-times the inner product of query and $p$, then $x$ is an acceptable solution.
  • Figure 5: Simulation results for Theorem \ref{['theorem:instability:beyer']} applied to NN with $L_2$ distance. Left: The ratio between the maximum distance between a query and data points $\delta_\ast$, to the minimum distance $\delta^\ast$. The shaded region shows one standard deviation. As dimensionality increases, this ratio tends to $1$. Right: The percentage of data points whose distance to a query is at most $(1 + \epsilon/100) \delta^\ast$, visualized for the Gaussian distribution---the trend is similar for other distributions. As $d$ increases, more vectors fall into the enlarged ball, making them valid solutions to the approximate NN problem.
  • ...and 20 more figures

Theorems & Definitions (135)

  • definition thmcounterdefinition: Top-$k$ Retrieval
  • theorem 1
  • proof
  • definition thmcounterdefinition: $\epsilon$-Approximate Top-$k$ Retrieval
  • theorem 2
  • proof
  • theorem 3
  • proof
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • ...and 125 more