Table of Contents
Fetching ...

PLAID: An Efficient Engine for Late Interaction Retrieval

Keshav Santhanam, Omar Khattab, Christopher Potts, Matei Zaharia

TL;DR

The paper tackles the high latency of late-interaction neural IR models, particularly ColBERTv2, by designing PLAID, an engine that rapidly filters candidate passages through centroid-based mechanisms. It introduces centroid interaction and centroid pruning to replace costly early-stage residual decompression for the majority of candidates, performing full scoring only on a small, high-quality set. Empirical evaluations across MS MARCO, Wikipedia, LoTTE, and MS MARCO v2 demonstrate substantial end-to-end speedups on both GPU (2.5–7x) and CPU (9–45x) with little to no loss in retrieval quality, scalable up to 140 million passages. The work also provides optimized kernels for padding-free MaxSim and decompression, supporting practical deployment and setting a new baseline for efficient late-interaction retrieval.

Abstract

Pre-trained language models are increasingly important components across multiple information retrieval (IR) paradigms. Late interaction, introduced with the ColBERT model and recently refined in ColBERTv2, is a popular paradigm that holds state-of-the-art status across many benchmarks. To dramatically speed up the search latency of late interaction, we introduce the Performance-optimized Late Interaction Driver (PLAID). Without impacting quality, PLAID swiftly eliminates low-scoring passages using a novel centroid interaction mechanism that treats every passage as a lightweight bag of centroids. PLAID uses centroid interaction as well as centroid pruning, a mechanism for sparsifying the bag of centroids, within a highly-optimized engine to reduce late interaction search latency by up to 7$\times$ on a GPU and 45$\times$ on a CPU against vanilla ColBERTv2, while continuing to deliver state-of-the-art retrieval quality. This allows the PLAID engine with ColBERTv2 to achieve latency of tens of milliseconds on a GPU and tens or just few hundreds of milliseconds on a CPU at large scale, even at the largest scales we evaluate with 140M passages.

PLAID: An Efficient Engine for Late Interaction Retrieval

TL;DR

The paper tackles the high latency of late-interaction neural IR models, particularly ColBERTv2, by designing PLAID, an engine that rapidly filters candidate passages through centroid-based mechanisms. It introduces centroid interaction and centroid pruning to replace costly early-stage residual decompression for the majority of candidates, performing full scoring only on a small, high-quality set. Empirical evaluations across MS MARCO, Wikipedia, LoTTE, and MS MARCO v2 demonstrate substantial end-to-end speedups on both GPU (2.5–7x) and CPU (9–45x) with little to no loss in retrieval quality, scalable up to 140 million passages. The work also provides optimized kernels for padding-free MaxSim and decompression, supporting practical deployment and setting a new baseline for efficient late-interaction retrieval.

Abstract

Pre-trained language models are increasingly important components across multiple information retrieval (IR) paradigms. Late interaction, introduced with the ColBERT model and recently refined in ColBERTv2, is a popular paradigm that holds state-of-the-art status across many benchmarks. To dramatically speed up the search latency of late interaction, we introduce the Performance-optimized Late Interaction Driver (PLAID). Without impacting quality, PLAID swiftly eliminates low-scoring passages using a novel centroid interaction mechanism that treats every passage as a lightweight bag of centroids. PLAID uses centroid interaction as well as centroid pruning, a mechanism for sparsifying the bag of centroids, within a highly-optimized engine to reduce late interaction search latency by up to 7 on a GPU and 45 on a CPU against vanilla ColBERTv2, while continuing to deliver state-of-the-art retrieval quality. This allows the PLAID engine with ColBERTv2 to achieve latency of tens of milliseconds on a GPU and tens or just few hundreds of milliseconds on a CPU at large scale, even at the largest scales we evaluate with 140M passages.
Paper Structure (26 sections, 5 equations, 8 figures, 6 tables)

This paper contains 26 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The late interaction architecture, given a query and a passage. Diagram from khattab2021relevance with permission.
  • Figure 2: Latency breakdown of MS MARCO v1 dev queries run with vanilla ColBERTv2 and PLAID ColBERTv2 on a TITAN V GPU. Vanilla ColBERTv2 is overwhelmingly bottlenecked with the cost of index lookup and decompression, a challenge that PLAID addresses.
  • Figure 3: Recall of passages retrieved by a centroid-only version of ColBERTv2 with respect to the top $k$ passages retrieved by vanilla ColBERTv2. Centroids alone can identify virtually all of the top-$k$ passages retrieved with the full ColBERTv2 pipeline, within $10 \cdot k$ or fewer candidates, motivating our centroid interaction strategy.
  • Figure 4: Centroid score distribution for each query among a random sample of 15 MS MARCO v1 dev queries evaluated with ColBERTv2.
  • Figure 5: The PLAID scoring pipeline. The first stage generates an initial set of candidate passages using the centroids. Next the second and third stages leverage centroid pruning and centroid interaction respectively to refine the candidate set. Then the last stage performs full residual decompression to obtain the final passage ranking. We use the hyperparameter ndocs to specify the number of candidates returned by Stage 2, and in our experiments we have Stage 3 output $\frac{\texttt{ndocs}}{4}$ passages.
  • ...and 3 more figures