Table of Contents
Fetching ...

Efficient Multi-Vector Dense Retrieval Using Bit Vectors

Franco Maria Nardini, Cosimo Rulli, Rossano Venturini

TL;DR

This work tackles the high memory and latency demands of multi-vector dense retrieval by proposing EMVB, a framework that combines bit-vector pre-filtering, SIMD-accelerated column-wise centroid interaction, Product Quantization for compact storage, and a per-passage term filtering strategy during late interaction. EMVB efficiently narrows the candidate set and accelerates centroid interactions, while PQ and term filtering reduce memory and computational load without harming retrieval accuracy. Empirical results on MS MARCO and LoTTE show EMVB delivers up to 2.8x faster queries and up to 1.8x memory reduction in-domain, with up to 2.9x speedups out-of-domain and minimal quality loss, representing a meaningful advance over PLAID. The approach targets practical deployment by leveraging CPU-friendly optimizations (SIMD, AVX512) and a careful decomposition of the late interaction pipeline, enabling scalable multi-vector retrieval systems.

Abstract

Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. Recently, PLAID has tackled these problems by introducing a centroid-based term representation to reduce the memory impact of multi-vector systems. By exploiting a centroid interaction mechanism, PLAID filters out non-relevant documents, thus reducing the cost of the successive ranking stages. This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors'' (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval. First, EMVB employs a highly efficient pre-filtering step of passages using optimized bit vectors. Second, the computation of the centroid interaction happens column-wise, exploiting SIMD instructions, thus reducing its latency. Third, EMVB leverages Product Quantization (PQ) to reduce the memory footprint of storing vector representations while jointly allowing for fast late interaction. Fourth, we introduce a per-document term filtering method that further improves the efficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVB is up to 2.8x faster while reducing the memory footprint by 1.8x with no loss in retrieval accuracy compared to PLAID.

Efficient Multi-Vector Dense Retrieval Using Bit Vectors

TL;DR

This work tackles the high memory and latency demands of multi-vector dense retrieval by proposing EMVB, a framework that combines bit-vector pre-filtering, SIMD-accelerated column-wise centroid interaction, Product Quantization for compact storage, and a per-passage term filtering strategy during late interaction. EMVB efficiently narrows the candidate set and accelerates centroid interactions, while PQ and term filtering reduce memory and computational load without harming retrieval accuracy. Empirical results on MS MARCO and LoTTE show EMVB delivers up to 2.8x faster queries and up to 1.8x memory reduction in-domain, with up to 2.9x speedups out-of-domain and minimal quality loss, representing a meaningful advance over PLAID. The approach targets practical deployment by leveraging CPU-friendly optimizations (SIMD, AVX512) and a careful decomposition of the late interaction pipeline, enabling scalable multi-vector retrieval systems.

Abstract

Dense retrieval techniques employ pre-trained large language models to build a high-dimensional representation of queries and passages. These representations compute the relevance of a passage w.r.t. to a query using efficient similarity measures. In this line, multi-vector representations show improved effectiveness at the expense of a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. Recently, PLAID has tackled these problems by introducing a centroid-based term representation to reduce the memory impact of multi-vector systems. By exploiting a centroid interaction mechanism, PLAID filters out non-relevant documents, thus reducing the cost of the successive ranking stages. This paper proposes ``Efficient Multi-Vector dense retrieval with Bit vectors'' (EMVB), a novel framework for efficient query processing in multi-vector dense retrieval. First, EMVB employs a highly efficient pre-filtering step of passages using optimized bit vectors. Second, the computation of the centroid interaction happens column-wise, exploiting SIMD instructions, thus reducing its latency. Third, EMVB leverages Product Quantization (PQ) to reduce the memory footprint of storing vector representations while jointly allowing for fast late interaction. Fourth, we introduce a per-document term filtering method that further improves the efficiency of the last step. Experiments on MS MARCO and LoTTE show that EMVB is up to 2.8x faster while reducing the memory footprint by 1.8x with no loss in retrieval accuracy compared to PLAID.
Paper Structure (10 sections, 6 equations, 5 figures, 2 tables)

This paper contains 10 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Breakdown of the PLAID average query latency (in milliseconds) on CPU across its four phases.
  • Figure 2: R@100 with various values of the threshold (left). Comparison of different algorithms to construct $\mathtt{close}^{th}_i$, for different values of $th$ (right).
  • Figure 3: Vectorized Fast Set Membership algorithm based on bit vectors.
  • Figure 4: Vectorized vs naïve Fast Set Membership (up). Ours vs PLAID filtering (down).
  • Figure 5: Performance of our dynamic term-selection filtering for different values of $th_r$, in terms of percentage of original effectiveness (left) and in terms of percentage of original number of scored terms (right). The percentage of original effectiveness is computed as the ratio between the MRR@10 computed with Equation \ref{['eq:ourmaxsimapprox']} and Equation \ref{['eq:ourmaxsim']}.