Efficient Multi-Vector Dense Retrieval Using Bit Vectors
Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
TL;DR
This work tackles the high memory and latency demands of multi-vector dense retrieval by proposing EMVB, a framework that combines bit-vector pre-filtering, SIMD-accelerated column-wise centroid interaction, Product Quantization for compact storage, and a per-passage term filtering strategy during late interaction. EMVB efficiently narrows the candidate set and accelerates centroid interactions, while PQ and term filtering reduce memory and computational load without harming retrieval accuracy. Empirical results on MS MARCO and LoTTE show EMVB delivers up to 2.8x faster queries and up to 1.8x memory reduction in-domain, with up to 2.9x speedups out-of-domain and minimal quality loss, representing a meaningful advance over PLAID. The approach targets practical deployment by leveraging CPU-friendly optimizations (SIMD, AVX512) and a careful decomposition of the late interaction pipeline, enabling scalable multi-vector retrieval systems.
Abstract
Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. Recently, PLAID has tackled these problems by introducing a centroid-based term representation to reduce the memory impact of multi-vector systems. By exploiting a centroid interaction mechanism, PLAID filters out non-relevant documents, thus reducing the cost of the successive ranking stages. This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors'' (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval. First, EMVB employs a highly efficient pre-filtering step of passages using optimized bit vectors. Second, the computation of the centroid interaction happens column-wise, exploiting SIMD instructions, thus reducing its latency. Third, EMVB leverages Product Quantization (PQ) to reduce the memory footprint of storing vector representations while jointly allowing for fast late interaction. Fourth, we introduce a per-document term filtering method that further improves the efficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVB is up to 2.8x faster while reducing the memory footprint by 1.8x with no loss in retrieval accuracy compared to PLAID.
