Exploring the Effectiveness of Multi-stage Fine-tuning for Cross-encoder Re-rankers
Francesca Pezzuti, Sean MacAvaney, Nicola Tonellotto
TL;DR
This paper examines whether multi-stage fine-tuning offers gains for cross-encoder re-rankers over single-stage training. It compares point-wise cross-encoders fine-tuned with a contrastive loss ($LCE$) versus a distillation loss ($RankNet$), and evaluates two-stage sequences (C→D, D→C) on MS MARCO and related benchmarks. Results indicate that the contrastive approach generally outperforms distillation, and that introducing a second fine-tuning stage does not yield statistically significant improvements over a single stage. The findings suggest that a well-constructed single-stage fine-tuning regime is sufficient for effective cross-encoder re-ranking, with code available for replication.
Abstract
State-of-the-art cross-encoders can be fine-tuned to be highly effective in passage re-ranking. The typical fine-tuning process of cross-encoders as re-rankers requires large amounts of manually labelled data, a contrastive learning objective, and a set of heuristically sampled negatives. An alternative recent approach for fine-tuning instead involves teaching the model to mimic the rankings of a highly effective large language model using a distillation objective. These fine-tuning strategies can be applied either individually, or in sequence. In this work, we systematically investigate the effectiveness of point-wise cross-encoders when fine-tuned independently in a single stage, or sequentially in two stages. Our experiments show that the effectiveness of point-wise cross-encoders fine-tuned using contrastive learning is indeed on par with that of models fine-tuned with multi-stage approaches. Code is available for reproduction at https://github.com/fpezzuti/multistage-finetuning.
