Table of Contents
Fetching ...

Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-Ranking

Ferdinand Schlatt, Maik Fröbe, Harrisen Scells, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Benno Stein, Martin Potthast, Matthias Hagen

TL;DR

Rank-DistiLLM addresses the gap between distilled cross-encoders and teacher LLMs for passage re-ranking by systematically analyzing fine-tuning practices, ranking depth, and data quality. It introduces the Rank-DistiLLM distillation dataset and a novel ADR-MSE listwise loss, demonstrating that cross-encoders fine-tuned on Rank-DistiLLM can match LLM performance while remaining orders of magnitude more efficient. The method combines MS MARCO-based pretraining with high-quality LLM-distillation data (RankGPT+, RankZephyr) and carefully evaluates both in-domain and out-of-domain settings, outperforming prior distillation datasets like RankGPT and TWOLAR. The results indicate broad practical impact: producing high-accuracy re-ranking models that are feasible for production-scale search with significantly reduced compute and memory requirements. The work provides a valuable dataset and a scalable distillation recipe to bring LLM-level re-ranking performance closer to real-world deployment.

Abstract

Cross-encoders distilled from large language models (LLMs) are often more effective re-rankers than cross-encoders fine-tuned on manually labeled data. However, distilled models do not match the effectiveness of their teacher LLMs. We hypothesize that this effectiveness gap is due to the fact that previous work has not applied the best-suited methods for fine-tuning cross-encoders on manually labeled data (e.g., hard-negative sampling, deep sampling, and listwise loss functions). To close this gap, we create a new dataset, Rank-DistiLLM. Cross-encoders trained on Rank-DistiLLM achieve the effectiveness of LLMs while being up to 173 times faster and 24 times more memory efficient. Our code and data is available at https://github.com/webis-de/ECIR-25.

Rank-DistiLLM: Closing the Effectiveness Gap Between Cross-Encoders and LLMs for Passage Re-Ranking

TL;DR

Rank-DistiLLM addresses the gap between distilled cross-encoders and teacher LLMs for passage re-ranking by systematically analyzing fine-tuning practices, ranking depth, and data quality. It introduces the Rank-DistiLLM distillation dataset and a novel ADR-MSE listwise loss, demonstrating that cross-encoders fine-tuned on Rank-DistiLLM can match LLM performance while remaining orders of magnitude more efficient. The method combines MS MARCO-based pretraining with high-quality LLM-distillation data (RankGPT+, RankZephyr) and carefully evaluates both in-domain and out-of-domain settings, outperforming prior distillation datasets like RankGPT and TWOLAR. The results indicate broad practical impact: producing high-accuracy re-ranking models that are feasible for production-scale search with significantly reduced compute and memory requirements. The work provides a valuable dataset and a scalable distillation recipe to bring LLM-level re-ranking performance closer to real-world deployment.

Abstract

Cross-encoders distilled from large language models (LLMs) are often more effective re-rankers than cross-encoders fine-tuned on manually labeled data. However, distilled models do not match the effectiveness of their teacher LLMs. We hypothesize that this effectiveness gap is due to the fact that previous work has not applied the best-suited methods for fine-tuning cross-encoders on manually labeled data (e.g., hard-negative sampling, deep sampling, and listwise loss functions). To close this gap, we create a new dataset, Rank-DistiLLM. Cross-encoders trained on Rank-DistiLLM achieve the effectiveness of LLMs while being up to 173 times faster and 24 times more memory efficient. Our code and data is available at https://github.com/webis-de/ECIR-25.
Paper Structure (20 sections, 3 equations, 1 figure, 3 tables)

This paper contains 20 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Average effectiveness on TREC DL 2019 and 2020 for models fine-tuned on subsamples of RankDistiLLM using different depths and numbers of samples.