Table of Contents
Fetching ...

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

Antoine Chaffin, Luca Arnaboldi, Amélie Chatelain, Florent Krzakala

TL;DR

This work evaluates whether pre-training ColBERT multi-vector models is advantageous over relying solely on knowledge distillation on top of a dense model. It demonstrates that a fully ColBERT-pre-trained model, ColBERT-Zero, trained on public data, can outperform state-of-the-art baselines that use stronger but closed data, establishing new performance standards for its size. A supervised contrastive phase before KD can closely approximate full pre-training at a fraction of the cost, offering a practical alternative when large-scale unsupervised pre-training is prohibitive. The study also highlights the critical role of aligning pre-training and fine-tuning setups, particularly regarding prompts, and provides public checkpoints and code to facilitate further exploration of multi-vector pre-training techniques.

Abstract

Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

TL;DR

This work evaluates whether pre-training ColBERT multi-vector models is advantageous over relying solely on knowledge distillation on top of a dense model. It demonstrates that a fully ColBERT-pre-trained model, ColBERT-Zero, trained on public data, can outperform state-of-the-art baselines that use stronger but closed data, establishing new performance standards for its size. A supervised contrastive phase before KD can closely approximate full pre-training at a fraction of the cost, offering a practical alternative when large-scale unsupervised pre-training is prohibitive. The study also highlights the critical role of aligning pre-training and fine-tuning setups, particularly regarding prompts, and provides public checkpoints and code to facilitate further exploration of multi-vector pre-training techniques.

Abstract

Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.
Paper Structure (9 sections, 2 figures, 5 tables)

This paper contains 9 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An illustration of the three training phases. Unsupervised contrastive pre-training is a very large scale training relying exclusively on large batch sizes to use in-batch negatives. Supervised contrastive fine-tuning is a refinement step that leverages mined hard negative to provide a stronger signal. The knowledge distillation step use a strong teacher to scores various documents and use KL divergence to make the student distribution fits the teachers'.
  • Figure 2: The different training pipelines compared in this work. KD only is the most common setup, while ColBERT-Zero is the most expensive. Supervised + KD is an in-between much cheaper than the full pre-training.