Towards Lossless Token Pruning in Late-Interaction Retrieval Models
Yuxuan Zong, Benjamin Piwowarski
TL;DR
This work tackles the memory bottleneck of lossless token pruning in late-interaction retrieval, proposing a principled dominance framework that preserves ColBERT scores while removing tokens. By adapting ColBERT to ColBERT_P with a projection and ReLU, and formalizing dominance as equivalent to a linear programming problem, the authors enable lossless pruning of a large fraction of document tokens. They introduce three regularizations (nuclear norm, document similarity, and L1) and two pruning strategies (LP-based with reduced dimension and norm-based) to maximize pruning while retaining retrieval quality, achieving up to ~70% token removal with minimal in-domain losses and strong out-of-domain performance. The approach yields practical gains in index efficiency and demonstrates robust performance across MS MARCO, BEIR, and LoTTE, with interpretabilityAnalysis suggesting pruned tokens align with semantically salient content.
Abstract
Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.
