Table of Contents
Fetching ...

Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning

Jiacheng Lin, Tian Wang, Kun Qian

TL;DR

This work introduces Rec-R1, a reinforcement learning framework that directly optimizes an LLM's generation using feedback from a fixed recommendation system, avoiding costly data distillation. By treating LLM-RecSys interactions as a closed-loop RL problem and employing GRPO with rule-based downstream rewards, Rec-R1 achieves substantial gains across product search, sequential recommendation, and product re-ranking while preserving the LLM's general capabilities. Theoretical results explain why prompting and SFT are limited in this setting, and empirical results demonstrate strong cross-domain generalization and improved cold-start performance. Overall, Rec-R1 offers a scalable, cost-efficient pathway to continual, task-specific adaptation of LLMs in real-world recommender systems without catastrophic forgetting.

Abstract

We propose Rec-R1, a general reinforcement learning framework that bridges large language models (LLMs) with recommendation systems through closed-loop optimization. Unlike prompting and supervised fine-tuning (SFT), Rec-R1 directly optimizes LLM generation using feedback from a fixed black-box recommendation model, without relying on synthetic SFT data from proprietary models such as GPT-4o. This avoids the substantial cost and effort required for data distillation. To verify the effectiveness of Rec-R1, we evaluate it on two representative tasks: product search and sequential recommendation. Experimental results demonstrate that Rec-R1 not only consistently outperforms prompting- and SFT-based methods, but also achieves significant gains over strong discriminative baselines, even when used with simple retrievers such as BM25. Moreover, Rec-R1 preserves the general-purpose capabilities of the LLM, unlike SFT, which often impairs instruction-following and reasoning. These findings suggest Rec-R1 as a promising foundation for continual task-specific adaptation without catastrophic forgetting.

Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning

TL;DR

This work introduces Rec-R1, a reinforcement learning framework that directly optimizes an LLM's generation using feedback from a fixed recommendation system, avoiding costly data distillation. By treating LLM-RecSys interactions as a closed-loop RL problem and employing GRPO with rule-based downstream rewards, Rec-R1 achieves substantial gains across product search, sequential recommendation, and product re-ranking while preserving the LLM's general capabilities. Theoretical results explain why prompting and SFT are limited in this setting, and empirical results demonstrate strong cross-domain generalization and improved cold-start performance. Overall, Rec-R1 offers a scalable, cost-efficient pathway to continual, task-specific adaptation of LLMs in real-world recommender systems without catastrophic forgetting.

Abstract

We propose Rec-R1, a general reinforcement learning framework that bridges large language models (LLMs) with recommendation systems through closed-loop optimization. Unlike prompting and supervised fine-tuning (SFT), Rec-R1 directly optimizes LLM generation using feedback from a fixed black-box recommendation model, without relying on synthetic SFT data from proprietary models such as GPT-4o. This avoids the substantial cost and effort required for data distillation. To verify the effectiveness of Rec-R1, we evaluate it on two representative tasks: product search and sequential recommendation. Experimental results demonstrate that Rec-R1 not only consistently outperforms prompting- and SFT-based methods, but also achieves significant gains over strong discriminative baselines, even when used with simple retrievers such as BM25. Moreover, Rec-R1 preserves the general-purpose capabilities of the LLM, unlike SFT, which often impairs instruction-following and reasoning. These findings suggest Rec-R1 as a promising foundation for continual task-specific adaptation without catastrophic forgetting.

Paper Structure

This paper contains 62 sections, 3 theorems, 31 equations, 6 figures, 20 tables.

Key Result

Lemma 1

Let $\pi_g(a|s)$ be a fixed target policy (e.g., the data-generating policy), and let $\pi_\theta(a|s)$ be a parameterized policy class. Consider the following maximum likelihood estimation (MLE) objective: Then maximizing this objective with respect to $\theta$ is equivalent to minimizing the expected Kullback-Leibler (KL) divergence between the target policy $\pi_g$ and the parameterized policy

Figures (6)

  • Figure 1: Illustration of how generative LLMs are applied in recommender systems (LLM4Rec), following the taxonomy in lin2025can. The upper row shows the use of LLMs for feature engineering, including (1) Query Rewriting, where the LLM reformulates the input query to improve retrieval, and (2) User/Item-level Feature Augmentation, where the LLM encodes user or item information into richer textual representations as input to a downstream model. The lower row demonstrates the use of LLMs as Scoring/Ranking Functions, including (3) Closed-Set Item Generation, where the LLM ranks a given candidate list, and (4) Open-Set Item Generation, where the LLM directly generate candidate items and matches them to a product pool. Note that this figure primarily reflects the inference-time setting—thus all LLMs are frozen. Our proposed Rec-R1 is compatible with all paradigms shown here (see Appendix \ref{['app:paradigms']}).
  • Figure 2: Proof-of-concept comparison under a small-scale setup, illustrating the limitations of SFT based on GPT-4o-generated data.(a) Performance on the ESCI dataset. The SFT baseline fine-tunes using data generated by GPT-4o, and its performance is inherently upper-bounded by the performance of GPT-4o itself. (b) Comparison of training time and cost. The total time for SFT includes both the data generation phase using GPT-4o and subsequent model fine-tuning. Rec-R1 requires no additional data generation, and we report the minimal training time and cost required to match the performance of the SFT and GPT-4o model. See Appendix \ref{['app:pof_estimate']} for cost estimation details.
  • Figure 3: Comparison of three paradigms for using LLMs in recommendation systems.(a) Prompting uses a frozen LLM to generate textual inputs for the recommendation system, without any model updates. (b) SFT trains the LLM to imitate outputs generated by a stronger model (e.g., GPT-4o), but the training process does not involve any RecSys feedback. (c)Rec-R1 introduces a closed-loop RL framework, where the LLM is optimized directly using reward signals from the recommendation system, without requiring external annotation or data distillation. Unlike SFT, which relies on labeled intermediate outputs (e.g., rewritten queries) from closed-source models, Rec-R1 operates directly on the same data and learns from the recommendation performance.
  • Figure 4: Generalization analysis across six benchmarks. We compare the initialized model (Qwen-2.5-3B-Instruct), its SFT variant trained on GPT-4o–generated SFT-data (ESCI), and our Rec-R1-3B model trained via RL. Note that Rec-R1 is only trained on the task-specific ESCI data, whose format drastically differs from the other benchmark datasets.
  • Figure 5: Qualitative comparison of retrieval results on the ESCI Video Games domain. We visualize the top-8 items retrieved by BM25 using different query formulations: the original user query (play station 3), a rewritten query by GPT-4o, and the output of our Rec-R1. Ground-truth relevant items are shown at the top. Rec-R1 significantly improves NDCG@100 by generating a highly detailed and semantically rich query, enabling precise matching with relevant items. Items correctly retrieved (i.e., appearing in the target set) are highlighted with red bounding boxes.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Lemma 1: MLE Minimizes KL Divergence
  • proof
  • proof
  • Theorem 1: Performance Difference Upper Bound
  • proof
  • Theorem 2: Superiority of RL over SFT
  • proof