Table of Contents
Fetching ...

Towards Lossless Token Pruning in Late-Interaction Retrieval Models

Yuxuan Zong, Benjamin Piwowarski

TL;DR

This work tackles the memory bottleneck of lossless token pruning in late-interaction retrieval, proposing a principled dominance framework that preserves ColBERT scores while removing tokens. By adapting ColBERT to ColBERT_P with a projection and ReLU, and formalizing dominance as equivalent to a linear programming problem, the authors enable lossless pruning of a large fraction of document tokens. They introduce three regularizations (nuclear norm, document similarity, and L1) and two pruning strategies (LP-based with reduced dimension and norm-based) to maximize pruning while retaining retrieval quality, achieving up to ~70% token removal with minimal in-domain losses and strong out-of-domain performance. The approach yields practical gains in index efficiency and demonstrates robust performance across MS MARCO, BEIR, and LoTTE, with interpretabilityAnalysis suggesting pruned tokens align with semantically salient content.

Abstract

Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

Towards Lossless Token Pruning in Late-Interaction Retrieval Models

TL;DR

This work tackles the memory bottleneck of lossless token pruning in late-interaction retrieval, proposing a principled dominance framework that preserves ColBERT scores while removing tokens. By adapting ColBERT to ColBERT_P with a projection and ReLU, and formalizing dominance as equivalent to a linear programming problem, the authors enable lossless pruning of a large fraction of document tokens. They introduce three regularizations (nuclear norm, document similarity, and L1) and two pruning strategies (LP-based with reduced dimension and norm-based) to maximize pruning while retaining retrieval quality, achieving up to ~70% token removal with minimal in-domain losses and strong out-of-domain performance. The approach yields practical gains in index efficiency and demonstrates robust performance across MS MARCO, BEIR, and LoTTE, with interpretabilityAnalysis suggesting pruned tokens align with semantically salient content.

Abstract

Late interaction neural IR models like ColBERT offer a competitive effectiveness-efficiency trade-off across many benchmarks. However, they require a huge memory space to store the contextual representation for all the document tokens. Some works have proposed using either heuristics or statistical-based techniques to prune tokens from each document. This however doesn't guarantee that the removed tokens have no impact on the retrieval score. Our work uses a principled approach to define how to prune tokens without impacting the score between a document and a query. We introduce three regularization losses, that induce a solution with high pruning ratios, as well as two pruning strategies. We study them experimentally (in and out-domain), showing that we can preserve ColBERT's performance while using only 30\% of the tokens.

Paper Structure

This paper contains 36 sections, 2 theorems, 18 equations, 2 figures, 5 tables.

Key Result

Lemma 1

Let $\mathbf{D}^+$ and $\mathbf{D}^-$ be a partition of the set of document vectors $\mathbf{D}\subset\mathbb{R}^d$ such that any document vector $\mathbf{d} \in \mathbf{D}^-$ is dominated by $\mathbf{D}$. Formally, Then, for any document $\mathbf{d}^- \in \mathbf{D}^-$, $\mathbf{d}^-$ is dominated by $\mathbf{D}^+$.

Figures (2)

  • Figure 1: Illustration of the concept of dominance. In this example, $\mathbf{d}_3$ is dominated by $\mathbf{d}_1$ and $\mathbf{d}_2$. $\mathbf{d}_3$ can be removed from the document representation without changing a modified ColBERT scoring function (see Section \ref{['sec:lp']}). Despite its low norm, $\mathbf{d}_2$ is kept since it brings new information about the relevance of the document. The hashed area corresponds to the half-space where the inner product of any vector with $\mathbf{d}_3$ is negative. The gray area is discussed in Section \ref{['sec:adapting-colbert']}.
  • Figure 2: Average TREC DL (2019 and 2020) nDCG@10 for different regularizations and pruning ratios

Theorems & Definitions (3)

  • Definition 1: Local dominance
  • Lemma 1: Global dominance
  • Lemma 2: Farkas' Lemma