Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning
Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
TL;DR
Rank-R1 introduces reinforcement learning to embed reasoning in an LLM-based Setwise reranker, enabling the model to reason over a query and candidate documents before selecting the most relevant item. Using GRPO, Rank-R1 trains with minimal relevance labels and a rule-based reward that requires the generated reasoning and final label to match ground-truth relevance, avoiding the need for human-annotated reasoning data. In-domain, Rank-R1 matches supervised fine-tuning while using roughly 18% of the MSMARCO data; out-of-domain on BRIGHT, a 14B Rank-R1 model substantially surpasses zero-shot and non-reasoning baselines, even outperforming GPT-4–based Listwise rerankers in some cases. The approach enhances explainability through explicit reasoning traces and demonstrates that reasoning-enabled rerankers can generalize better to complex, cross-domain queries, with practical implications for search engine result presentation and user trust.
Abstract
In this paper, we introduce Rank-R1, a novel LLM-based reranker that performs reasoning over both the user query and candidate documents before performing the ranking task. Existing document reranking methods based on large language models (LLMs) typically rely on prompting or fine-tuning LLMs to order or label candidate documents according to their relevance to a query. For Rank-R1, we use a reinforcement learning algorithm along with only a small set of relevance labels (without any reasoning supervision) to enhance the reasoning ability of LLM-based rerankers. Our hypothesis is that adding reasoning capabilities to the rerankers can improve their relevance assessement and ranking capabilities. Our experiments on the TREC DL and BRIGHT datasets show that Rank-R1 is highly effective, especially for complex queries. In particular, we find that Rank-R1 achieves effectiveness on in-domain datasets at par with that of supervised fine-tuning methods, but utilizing only 18\% of the training data used by the fine-tuning methods. We also find that the model largely outperforms zero-shot and supervised fine-tuning when applied to out-of-domain datasets featuring complex queries, especially when a 14B-size model is used. Finally, we qualitatively observe that Rank-R1's reasoning process improves the explainability of the ranking results, opening new opportunities for search engine results presentation and fruition.
