Table of Contents
Fetching ...

Generative Reasoning Re-ranker

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Jacob Tao, Shike Mei, Hamed Firooz, Wenlin Chen, Luke Simon

TL;DR

This paper tackles the reranking gap in LLM-based recommender systems by introducing Generative Reasoning Re-ranker (GR2), a three-stage pipeline that (1) mid-trains a pretrained LLM on semantic item IDs (SIDs) to bridge item semantics with world knowledge, (2) has a stronger LLM generate high-quality reasoning traces via carefully designed prompts and rejection sampling for supervised fine-tuning, and (3) applies Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) with verifiable rewards tailored for reranking. GR2 demonstrates state-of-the-art gains over OneRec-Think in Recall@5 and NDCG@5 across two real-world datasets, with ablations showing substantial benefits from advanced reasoning traces and the necessity of RL to translate reasoning into ranking improvements. The work also highlights the risk of reward hacking in RL and proposes conditional verifiable rewards to mitigate it, thereby improving robustness and effectiveness of the reranking process. Overall, GR2 advances interpretable, scalable, reasoning-aware reranking for large-scale recommender systems by tightly integrating semantic representations, structured reasoning, and reward-driven supervision.

Abstract

Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.

Generative Reasoning Re-ranker

TL;DR

This paper tackles the reranking gap in LLM-based recommender systems by introducing Generative Reasoning Re-ranker (GR2), a three-stage pipeline that (1) mid-trains a pretrained LLM on semantic item IDs (SIDs) to bridge item semantics with world knowledge, (2) has a stronger LLM generate high-quality reasoning traces via carefully designed prompts and rejection sampling for supervised fine-tuning, and (3) applies Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) with verifiable rewards tailored for reranking. GR2 demonstrates state-of-the-art gains over OneRec-Think in Recall@5 and NDCG@5 across two real-world datasets, with ablations showing substantial benefits from advanced reasoning traces and the necessity of RL to translate reasoning into ranking improvements. The work also highlights the risk of reward hacking in RL and proposes conditional verifiable rewards to mitigate it, thereby improving robustness and effectiveness of the reranking process. Overall, GR2 advances interpretable, scalable, reasoning-aware reranking for large-scale recommender systems by tightly integrating semantic representations, structured reasoning, and reward-driven supervision.

Abstract

Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving 99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.
Paper Structure (50 sections, 27 equations, 1 figure, 6 tables, 1 algorithm)

This paper contains 50 sections, 27 equations, 1 figure, 6 tables, 1 algorithm.

Figures (1)

  • Figure 1: Overview of the 3-stage training pipeline: student LLM mid-training on tokenized semantic IDs (up), reasoning data generation with teacher LLM and rejection sampling (middle), and student LLM reasoning enablement by SFT and RL (down)