MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

Laxman Dhulipala; Majid Hadian; Rajesh Jayaram; Jason Lee; Vahab Mirrokni

MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

Laxman Dhulipala, Majid Hadian, Rajesh Jayaram, Jason Lee, Vahab Mirrokni

TL;DR

MUVERA tackles the high cost of multi-vector retrieval by reducing Chamfer-based MV similarity to a single-vector MIPS problem through Fixed Dimensional Encodings (FDEs). The FDEs provide $ε$-approximation guarantees to Chamfer similarity and are data-oblivious, enabling efficient indexing with off-the-shelf MIPS solvers and a single reranking step. The approach yields robust end-to-end improvements: on BEIR benchmarks, it achieves about 10% higher recall with roughly 90% lower latency than the prior PLAID system, and supports substantial memory savings via Product Quantization (PQ) compression (32×). These results demonstrate that principled probabilistic partitioning and projection can bridge the gap between single- and multi-vector retrieval, offering a practical, scalable MV retrieval solution with theoretical backing.

Abstract

Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding $x \in \mathbb{R}^d$ per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring. In this paper, we introduce MUVERA (MUlti-VEctor Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality $ε$-approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5$\times$ fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10$\%$ improved recall with $90\%$ lower latency.

MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

TL;DR

MUVERA tackles the high cost of multi-vector retrieval by reducing Chamfer-based MV similarity to a single-vector MIPS problem through Fixed Dimensional Encodings (FDEs). The FDEs provide

-approximation guarantees to Chamfer similarity and are data-oblivious, enabling efficient indexing with off-the-shelf MIPS solvers and a single reranking step. The approach yields robust end-to-end improvements: on BEIR benchmarks, it achieves about 10% higher recall with roughly 90% lower latency than the prior PLAID system, and supports substantial memory savings via Product Quantization (PQ) compression (32×). These results demonstrate that principled probabilistic partitioning and projection can bridge the gap between single- and multi-vector retrieval, offering a practical, scalable MV retrieval solution with theoretical backing.

Abstract

Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding

per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring. In this paper, we introduce MUVERA (MUlti-VEctor Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality

-approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5

fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10

improved recall with

lower latency.

Paper Structure (30 sections, 4 theorems, 28 equations, 15 figures, 4 tables)

This paper contains 30 sections, 4 theorems, 28 equations, 15 figures, 4 tables.

Introduction
Contributions.
Chamfer Similarity and the Multi-Vector Retrieval Problem
Our Approach: Reducing Multi-Vector Search to Single-Vector MIPS
Related Work on Multi-Vector Retrieval
Fixed Dimensional Encodings
Theoretical Guarantees for FDEs
Evaluation
Offline Evaluation of FDE Quality
Single Vector Heuristic vs. FDE retrieval.
Variance.
Online Implementation and End-to-End Evaluation
Single-Vector MIPS Retrieval using DiskANN
Ball Carving.
Product Quantization (PQ).
...and 15 more sections

Key Result

Theorem 2.1

Fix any $\varepsilon ,\delta > 0$, and sets $Q,P \subset \mathbbm R^d$ of unit vectors, and let $m=|Q| + |P|$. Then setting $k_{\texttt{sim}} = O\left(\frac{\log (m\delta^{-1})}{\varepsilon}\right)$, $d_{\texttt{proj}} = O\left(\frac{1}{\varepsilon^2} \log (\frac{m}{\varepsilon\delta})\right)$, $R_

Figures (15)

Figure 1: $\textsc{Muvera}$'s two-step retrieval process, comapred to PLAID's multi-stage retrieval process. Diagram on the right from Santhanam et. al. santhanam2022plaid with permission.
Figure 2: FDE Generation Process. Three SimHashes ($k_{\texttt{sim}} = 3$) split space into six regions labelled $A$-$F$ (in high-dimensions $B= 2^{k_{\texttt{sim}}}$, but $B=6$ here since $d=2$). $\mathbf{F}_{\text{q}}(Q),\mathbf{F}_{\text{doc}}(P)$ are shown as $B \times d$ matrices, where the $k$-th row is $\vec{q}_{(k)}, \vec{p}_{(k)}$. The actual FDEs are flattened versions of these matrices. Not shown: inner projections, repetitions, and fill_empty_clusters.
Figure 3: FDE recall vs dimension for varying FDE parameters on MS MARCO. Plots show FDE Recall$@$100,1k,10k left to right. Recalls$@N$ for exact Chamfer scoring is shown by dotted lines.
Figure 4: Comparison of FDE recall versus brute-force search over Chamfer similarity.
Figure 5: FDE retrieval vs SV Heuristic, both with and without document id deduplication.
...and 10 more figures

Theorems & Definitions (10)

Theorem 2.1: FDE Approximation
Theorem 2.2
Lemma A.1
proof
Lemma A.3: One-Sided Error Estimator
proof
proof : Proof of Theorem \ref{['thm:FDE-approx']}
Claim A.4
proof
proof : Proof of Theorem \ref{['thm:FDE-ANN']}

MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

TL;DR

Abstract

MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (10)