Table of Contents
Fetching ...

MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

Laxman Dhulipala, Majid Hadian, Rajesh Jayaram, Jason Lee, Vahab Mirrokni

TL;DR

MUVERA tackles the high cost of multi-vector retrieval by reducing Chamfer-based MV similarity to a single-vector MIPS problem through Fixed Dimensional Encodings (FDEs). The FDEs provide $ε$-approximation guarantees to Chamfer similarity and are data-oblivious, enabling efficient indexing with off-the-shelf MIPS solvers and a single reranking step. The approach yields robust end-to-end improvements: on BEIR benchmarks, it achieves about 10% higher recall with roughly 90% lower latency than the prior PLAID system, and supports substantial memory savings via Product Quantization (PQ) compression (32×). These results demonstrate that principled probabilistic partitioning and projection can bridge the gap between single- and multi-vector retrieval, offering a practical, scalable MV retrieval solution with theoretical backing.

Abstract

Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding $x \in \mathbb{R}^d$ per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring. In this paper, we introduce MUVERA (MUlti-VEctor Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality $ε$-approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5$\times$ fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10$\%$ improved recall with $90\%$ lower latency.

MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

TL;DR

MUVERA tackles the high cost of multi-vector retrieval by reducing Chamfer-based MV similarity to a single-vector MIPS problem through Fixed Dimensional Encodings (FDEs). The FDEs provide -approximation guarantees to Chamfer similarity and are data-oblivious, enabling efficient indexing with off-the-shelf MIPS solvers and a single reranking step. The approach yields robust end-to-end improvements: on BEIR benchmarks, it achieves about 10% higher recall with roughly 90% lower latency than the prior PLAID system, and supports substantial memory savings via Product Quantization (PQ) compression (32×). These results demonstrate that principled probabilistic partitioning and projection can bridge the gap between single- and multi-vector retrieval, offering a practical, scalable MV retrieval solution with theoretical backing.

Abstract

Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring. In this paper, we introduce MUVERA (MUlti-VEctor Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality -approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5 fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10 improved recall with lower latency.
Paper Structure (30 sections, 4 theorems, 28 equations, 15 figures, 4 tables)

This paper contains 30 sections, 4 theorems, 28 equations, 15 figures, 4 tables.

Key Result

Theorem 2.1

Fix any $\varepsilon ,\delta > 0$, and sets $Q,P \subset \mathbbm R^d$ of unit vectors, and let $m=|Q| + |P|$. Then setting $k_{\texttt{sim}} = O\left(\frac{\log (m\delta^{-1})}{\varepsilon}\right)$, $d_{\texttt{proj}} = O\left(\frac{1}{\varepsilon^2} \log (\frac{m}{\varepsilon\delta})\right)$, $R_

Figures (15)

  • Figure 1: $\textsc{Muvera}$'s two-step retrieval process, comapred to PLAID's multi-stage retrieval process. Diagram on the right from Santhanam et. al. santhanam2022plaid with permission.
  • Figure 2: FDE Generation Process. Three SimHashes ($k_{\texttt{sim}} = 3$) split space into six regions labelled $A$-$F$ (in high-dimensions $B= 2^{k_{\texttt{sim}}}$, but $B=6$ here since $d=2$). $\mathbf{F}_{\text{q}}(Q),\mathbf{F}_{\text{doc}}(P)$ are shown as $B \times d$ matrices, where the $k$-th row is $\vec{q}_{(k)}, \vec{p}_{(k)}$. The actual FDEs are flattened versions of these matrices. Not shown: inner projections, repetitions, and fill_empty_clusters.
  • Figure 3: FDE recall vs dimension for varying FDE parameters on MS MARCO. Plots show FDE Recall$@$100,1k,10k left to right. Recalls$@N$ for exact Chamfer scoring is shown by dotted lines.
  • Figure 4: Comparison of FDE recall versus brute-force search over Chamfer similarity.
  • Figure 5: FDE retrieval vs SV Heuristic, both with and without document id deduplication.
  • ...and 10 more figures

Theorems & Definitions (10)

  • Theorem 2.1: FDE Approximation
  • Theorem 2.2
  • Lemma A.1
  • proof
  • Lemma A.3: One-Sided Error Estimator
  • proof
  • proof : Proof of Theorem \ref{['thm:FDE-approx']}
  • Claim A.4
  • proof
  • proof : Proof of Theorem \ref{['thm:FDE-ANN']}