MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings
Laxman Dhulipala, Majid Hadian, Rajesh Jayaram, Jason Lee, Vahab Mirrokni
TL;DR
MUVERA tackles the high cost of multi-vector retrieval by reducing Chamfer-based MV similarity to a single-vector MIPS problem through Fixed Dimensional Encodings (FDEs). The FDEs provide $ε$-approximation guarantees to Chamfer similarity and are data-oblivious, enabling efficient indexing with off-the-shelf MIPS solvers and a single reranking step. The approach yields robust end-to-end improvements: on BEIR benchmarks, it achieves about 10% higher recall with roughly 90% lower latency than the prior PLAID system, and supports substantial memory savings via Product Quantization (PQ) compression (32×). These results demonstrate that principled probabilistic partitioning and projection can bridge the gap between single- and multi-vector retrieval, offering a practical, scalable MV retrieval solution with theoretical backing.
Abstract
Neural embedding models have become a fundamental component of modern information retrieval (IR) pipelines. These models produce a single embedding $x \in \mathbb{R}^d$ per data-point, allowing for fast retrieval via highly optimized maximum inner product search (MIPS) algorithms. Recently, beginning with the landmark ColBERT paper, multi-vector models, which produce a set of embedding per data point, have achieved markedly superior performance for IR tasks. Unfortunately, using these models for IR is computationally expensive due to the increased complexity of multi-vector retrieval and scoring. In this paper, we introduce MUVERA (MUlti-VEctor Retrieval Algorithm), a retrieval mechanism which reduces multi-vector similarity search to single-vector similarity search. This enables the usage of off-the-shelf MIPS solvers for multi-vector retrieval. MUVERA asymmetrically generates Fixed Dimensional Encodings (FDEs) of queries and documents, which are vectors whose inner product approximates multi-vector similarity. We prove that FDEs give high-quality $ε$-approximations, thus providing the first single-vector proxy for multi-vector similarity with theoretical guarantees. Empirically, we find that FDEs achieve the same recall as prior state-of-the-art heuristics while retrieving 2-5$\times$ fewer candidates. Compared to prior state of the art implementations, MUVERA achieves consistently good end-to-end recall and latency across a diverse set of the BEIR retrieval datasets, achieving an average of 10$\%$ improved recall with $90\%$ lower latency.
