A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models

Yash Kankanampati; Yuxuan Zong; Nadi Tomeh; Benjamin Piwowarski; Joseph Le Roux

A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models

Yash Kankanampati, Yuxuan Zong, Nadi Tomeh, Benjamin Piwowarski, Joseph Le Roux

TL;DR

This work introduces a framework grounded in hyperspace geometry and cast token pruning as a Voronoi cell estimation problem in the embedding space and demonstrates that this approach serves not only as a competitive pruning strategy but also as a valuable tool for improving and interpreting token-level behavior within dense retrieval systems.

Abstract

Late-interaction models like ColBERT offer a competitive performance across various retrieval tasks, but require storing a dense embedding for each document token, leading to a substantial index storage overhead. Past works address this by attempting to prune low-importance token embeddings based on statistical and empirical measures, but they often either lack formal grounding or are ineffective. To address these shortcomings, we introduce a framework grounded in hyperspace geometry and cast token pruning as a Voronoi cell estimation problem in the embedding space. By interpreting each token's influence as a measure of its Voronoi region, our approach enables principled pruning that retains retrieval quality while reducing index size. Through our experiments, we demonstrate that this approach serves not only as a competitive pruning strategy but also as a valuable tool for improving and interpreting token-level behavior within dense retrieval systems.

A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models

TL;DR

Abstract

Paper Structure (36 sections, 10 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 36 sections, 10 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Late-interaction Retrieval
Index Compression
Index Pruning
Problem Formulation
Methodology
Preliminaries
Voronoi Pruning Algorithm
Monte Carlo Estimation of Expected Error
Iterative Pruning
Global Pruning
Improving Greedy Pruning
Experimental Setup
Implementation Details
...and 21 more sections

Figures (6)

Figure 1: Analysis of query embeddings from MS MARCO using ColBERT .
Figure 2: An example illustrating the difference in iterative and non-iterative Voronoi pruning of 2D document vectors. Each subfigure shows the maximum dot-product Voronoi regions and the retained document vectors.
Figure 3: Performance of LP Pruning and Voronoi Pruning on document vectors produced by ColBERT finetuned with the docsim regularizer (with $\alpha=0.8$).
Figure 4: Distributions showing how often document tokens contribute to max-dot product scores (in blue) and the aggregated mean errors over the position of tokens (in orange).
Figure 5: Distribution showing when a token at a particular position is pruned, relative to other tokens in the document. Lower values signify earlier pruning.
...and 1 more figures

A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models

TL;DR

Abstract

A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)