Table of Contents
Fetching ...

R1-Ranker: Teaching LLM Rankers to Reason

Tao Feng, Zhigang Hua, Zijie Lei, Yan Xie, Shuang Yang, Bo Long, Jiaxuan You

TL;DR

This paper tackles unifying diverse ranking tasks for LLM-based rankers by introducing R1-Ranker, a reasoning-driven reinforcement learning framework. It presents two designs: DRanker for full-ranking in one shot and IRanker for iterative exclusion to enable deeper reasoning with reduced output space. Across nine datasets spanning recommendation, routing, and passage ranking, IRanker-3B achieves state-of-the-art performance among general baselines and is competitive with domain-specific methods, including a 15.7% relative improvement on average; zero-shot experiments and reasoning traces further demonstrate transferability to other LLMs and out-of-domain tasks. The work suggests that a unified, reasoning-focused foundation can robustly tackle diverse ranking problems and lay groundwork for efficient, scalable LLM-based ranking systems.

Abstract

Large language models (LLMs) have recently shown strong reasoning abilities in domains like mathematics, coding, and scientific problem-solving, yet their potential for ranking tasks, where prime examples include retrieval, recommender systems, and LLM routing, remains underexplored. Ranking requires complex reasoning across heterogeneous candidates, but existing LLM-based rankers are often domain-specific, tied to fixed backbones, and lack iterative refinement, limiting their ability to fully exploit LLMs' reasoning potential. To address these challenges, we propose R1-Ranker, a reasoning-incentive framework built on reinforcement learning, with two complementary designs: DRanker, which generates full rankings in one shot, and IRanker, which decomposes ranking into an iterative elimination process with step-wise rewards to encourage deeper reasoning. We evaluate unified R1-Rankers on nine datasets spanning recommendation, routing, and passage ranking, showing that IRanker-3B consistently achieves state-of-the-art performance, surpasses larger 7B models on some tasks, and yields a 15.7% average relative improvement. Ablation and generalization experiments further confirm the critical role of reinforcement learning and iterative reasoning, with IRanker-3B improving zero-shot performance by over 9% on out-of-domain tasks and reasoning traces boosting other LLMs by up to 22.87%. These results demonstrate that unifying diverse ranking tasks with a single reasoning-driven foundation model is both effective and essential for advancing LLM reasoning in ranking scenarios.

R1-Ranker: Teaching LLM Rankers to Reason

TL;DR

This paper tackles unifying diverse ranking tasks for LLM-based rankers by introducing R1-Ranker, a reasoning-driven reinforcement learning framework. It presents two designs: DRanker for full-ranking in one shot and IRanker for iterative exclusion to enable deeper reasoning with reduced output space. Across nine datasets spanning recommendation, routing, and passage ranking, IRanker-3B achieves state-of-the-art performance among general baselines and is competitive with domain-specific methods, including a 15.7% relative improvement on average; zero-shot experiments and reasoning traces further demonstrate transferability to other LLMs and out-of-domain tasks. The work suggests that a unified, reasoning-focused foundation can robustly tackle diverse ranking problems and lay groundwork for efficient, scalable LLM-based ranking systems.

Abstract

Large language models (LLMs) have recently shown strong reasoning abilities in domains like mathematics, coding, and scientific problem-solving, yet their potential for ranking tasks, where prime examples include retrieval, recommender systems, and LLM routing, remains underexplored. Ranking requires complex reasoning across heterogeneous candidates, but existing LLM-based rankers are often domain-specific, tied to fixed backbones, and lack iterative refinement, limiting their ability to fully exploit LLMs' reasoning potential. To address these challenges, we propose R1-Ranker, a reasoning-incentive framework built on reinforcement learning, with two complementary designs: DRanker, which generates full rankings in one shot, and IRanker, which decomposes ranking into an iterative elimination process with step-wise rewards to encourage deeper reasoning. We evaluate unified R1-Rankers on nine datasets spanning recommendation, routing, and passage ranking, showing that IRanker-3B consistently achieves state-of-the-art performance, surpasses larger 7B models on some tasks, and yields a 15.7% average relative improvement. Ablation and generalization experiments further confirm the critical role of reinforcement learning and iterative reasoning, with IRanker-3B improving zero-shot performance by over 9% on out-of-domain tasks and reasoning traces boosting other LLMs by up to 22.87%. These results demonstrate that unifying diverse ranking tasks with a single reasoning-driven foundation model is both effective and essential for advancing LLM reasoning in ranking scenarios.

Paper Structure

This paper contains 18 sections, 7 equations, 4 figures, 37 tables.

Figures (4)

  • Figure 1: Example ranking tasks that a proposed R1-Ranker can solve. (a) The recommendation task aims to model the user's preferences based on their historical behaviors. It will rank the current item candidates and predict which items the user is most likely to prefer. (b) The routing task is to recommend suitable LLMs to respond to different user queries. The recommendation process takes into account the effectiveness and cost of each LLM's response, and performs ranking to generate the final recommendation list. (c) Passage ranking involves retrieving a set of passages from candidate passages based on a given user query for retrieval-augmented generation. It ranks the passages by modeling the relevance between the query and the passages to produce the final list of passages.
  • Figure 2: Framework of our proposed R1-Ranker. Both DRanker and IRanker are RL-enhanced LLM frameworks that exploit the reasoning ability of LLMs to solve ranking tasks. They take as input the candidate information in text form, along with user information (such as user history or a query), and utilize LLM reasoning to produce a final candidate ranking. This ranking is then evaluated by an evaluator to generate a corresponding reward signal, which is used to optimize the decision-making of both rankers. The key distinctions are: 1) DRanker performs reasoning once to generate the full ranking in a single step, whereas IRanker conducts step-wise reasoning by iteratively excluding the least likely item from the candidate pool. 2) The reward in DRanker is a ranking reward based on the final candidate list, while the reward in IRanker is an exclusion reward provided for each individual decision, which encourages finer-grained reasoning. 3) DRanker always receives the full set of candidates as input with a fixed size, whereas IRanker’s input candidates are dynamically updated based on the excluded items, enabling adaptive reasoning throughout the ranking process.
  • Figure 3: IRanker-3B matches the performance of domain-specific methods across multiple tasks with fewer training samples and smaller model size. We compared the performance of IRanker-3B against three representative SOTA methods and Qwen2.5-3B-Instruct-iter across three scenarios. SOTA-1, SOTA-2, and SOTA-3 correspond to SASRec kang2018self, BPR rendle2012bpr, and R1-Rec lin2025rec in the recommendation (Rec) scenario; GraphRouter feng2024graphrouter, RouterBert ong2024routellm, and RouterKNN hu2024routerbench in the routing (Router) scenario; RankLLama-8B ma2024fine, RankBERT nogueira2019passage, and MonoT5 nogueira2020document in the passage ranking (Passage) scenario.
  • Figure 4: Thoughts emerged by IRanker during training can enhance zero-shot performance of the base model. IRanker-COT-3B is an iterative framework that, for each test query, retrieves similar queries and their corresponding thoughts that emerged during the training of IRanker, using them as thought templates to guide zero-shot responses. We evaluate IRanker-COT-3B on nine tasks and compare its performance with IRanker-3B and Qwen2.5-3B-Instruct-iter. The results show that IRanker-COT-3B consistently outperforms Qwen2.5-3B-Instruct-iter and even surpasses IRanker-3B in the Rec-Game task.