ColBERT-serve: Efficient Multi-Stage Memory-Mapped Scoring
Kaili Huang, Thejas Venkatesh, Uma Dingankar, Antonio Mallia, Daniel Campos, Jian Jiao, Christopher Potts, Matei Zaharia, Kwabena Boahen, Omar Khattab, Saarthak Sarup, Keshav Santhanam
TL;DR
This work tackles the challenge of serving high-quality neural IR models, such as ColBERTv2, under tight RAM budgets and high concurrency by introducing ColBERT-serve. It combines a memory-mapped index, a multi-stage retrieval pipeline with SPLADEv2 as the first stage, and a hybrid scoring mechanism to preserve retrieval quality while dramatically reducing memory usage. The system demonstrates up to 4 queries per second on machines with only a few gigabytes of RAM and achieves about a 90% reduction in RAM compared to full ColBERTv2, with quality maintained or improved through the hybrid scoring strategy. The approach enables cost-effective, scalable deployment of late-interaction neural IR over large collections, addressing the practical gap between latency, memory, and accuracy in real-world serving scenarios, and provides a benchmark methodology for concurrent neural IR under memory budgets.
Abstract
We study serving retrieval models, specifically late interaction models like ColBERT, to many concurrent users at once and under a small budget, in which the index may not fit in memory. We present ColBERT-serve, a novel serving system that applies a memory-mapping strategy to the ColBERT index, reducing RAM usage by 90% and permitting its deployment on cheap servers, and incorporates a multi-stage architecture with hybrid scoring, reducing ColBERT's query latency and supporting many concurrent queries in parallel.
