Table of Contents
Fetching ...

ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring

Kaili Huang, Thejas Venkatesh, Uma Dingankar, Antonio Mallia, Daniel Campos, Jian Jiao, Christopher Potts, Matei Zaharia, Kwabena Boahen, Omar Khattab, Saarthak Sarup, Keshav Santhanam

TL;DR

This work tackles the challenge of serving high-quality neural IR models, such as ColBERTv2, under tight RAM budgets and high concurrency by introducing ColBERT-serve. It combines a memory-mapped index, a multi-stage retrieval pipeline with SPLADEv2 as the first stage, and a hybrid scoring mechanism to preserve retrieval quality while dramatically reducing memory usage. The system demonstrates up to 4 queries per second on machines with only a few gigabytes of RAM and achieves about a 90% reduction in RAM compared to full ColBERTv2, with quality maintained or improved through the hybrid scoring strategy. The approach enables cost-effective, scalable deployment of late-interaction neural IR over large collections, addressing the practical gap between latency, memory, and accuracy in real-world serving scenarios, and provides a benchmark methodology for concurrent neural IR under memory budgets.

Abstract

We study serving retrieval models, specifically late interaction models like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a novel serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.

ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring

TL;DR

This work tackles the challenge of serving high-quality neural IR models, such as ColBERTv2, under tight RAM budgets and high concurrency by introducing ColBERT-serve. It combines a memory-mapped index, a multi-stage retrieval pipeline with SPLADEv2 as the first stage, and a hybrid scoring mechanism to preserve retrieval quality while dramatically reducing memory usage. The system demonstrates up to 4 queries per second on machines with only a few gigabytes of RAM and achieves about a 90% reduction in RAM compared to full ColBERTv2, with quality maintained or improved through the hybrid scoring strategy. The approach enables cost-effective, scalable deployment of late-interaction neural IR over large collections, addressing the practical gap between latency, memory, and accuracy in real-world serving scenarios, and provides a benchmark methodology for concurrent neural IR under memory budgets.

Abstract

We study serving retrieval models, specifically late interaction models like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a novel serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.

Paper Structure

This paper contains 12 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: P95 Latency on Wikipedia Dataset.
  • Figure 2: P95 Latency on MS MARCO and LoTTE. Note that full ColBERTv2 on MS MARCO is evaluated on a higher-end and more expensive machine (refer to Table \ref{['table:specs']}) with a different physical processor, so its latency is only for reference and is not directly comparable to the MMAP methods.