Table of Contents
Fetching ...

OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, Guorui Zhou

TL;DR

This work addresses the inefficiency of cascade retrieval by introducing OneRec, an end-to-end generative recommender that directly generates a session of items. It combines a transformer-based encoder-decoder with sparse Mixture-of-Experts, a session-wise generation objective, and Iterative Preference Alignment guided by a Reward Model and Direct Preference Optimization to align outputs with user preferences. The approach is validated offline on large-scale data and online via deployment at Kuaishou, showing improvements in watch-time and engagement metrics. The paper also presents a practical training/deployment pipeline, including self-hard negative sampling and cost-efficient DPO, demonstrating the viability of industrial-scale end-to-end generative recommendation.

Abstract

Recently, generative retrieval-based recommendation systems have emerged as a promising paradigm. However, most modern recommender systems adopt a retrieve-and-rank strategy, where the generative model functions only as a selector during the retrieval stage. In this paper, we propose OneRec, which replaces the cascaded learning framework with a unified generative model. To the best of our knowledge, this is the first end-to-end generative model that significantly surpasses current complex and well-designed recommender systems in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder structure, which encodes the user's historical behavior sequences and gradually decodes the videos that the user may be interested in. We adopt sparse Mixture-of-Experts (MoE) to scale model capacity without proportionally increasing computational FLOPs. 2) a session-wise generation approach. In contrast to traditional next-item prediction, we propose a session-wise generation, which is more elegant and contextually coherent than point-by-point generation that relies on hand-crafted rules to properly combine the generated results. 3) an Iterative Preference Alignment module combined with Direct Preference Optimization (DPO) to enhance the quality of the generated results. Unlike DPO in NLP, a recommendation system typically has only one opportunity to display results for each user's browsing request, making it impossible to obtain positive and negative samples simultaneously. To address this limitation, We design a reward model to simulate user generation and customize the sampling strategy. Extensive experiments have demonstrated that a limited number of DPO samples can align user interest preferences and significantly improve the quality of generated results. We deployed OneRec in the main scene of Kuaishou, achieving a 1.6\% increase in watch-time, which is a substantial improvement.

OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

TL;DR

This work addresses the inefficiency of cascade retrieval by introducing OneRec, an end-to-end generative recommender that directly generates a session of items. It combines a transformer-based encoder-decoder with sparse Mixture-of-Experts, a session-wise generation objective, and Iterative Preference Alignment guided by a Reward Model and Direct Preference Optimization to align outputs with user preferences. The approach is validated offline on large-scale data and online via deployment at Kuaishou, showing improvements in watch-time and engagement metrics. The paper also presents a practical training/deployment pipeline, including self-hard negative sampling and cost-efficient DPO, demonstrating the viability of industrial-scale end-to-end generative recommendation.

Abstract

Recently, generative retrieval-based recommendation systems have emerged as a promising paradigm. However, most modern recommender systems adopt a retrieve-and-rank strategy, where the generative model functions only as a selector during the retrieval stage. In this paper, we propose OneRec, which replaces the cascaded learning framework with a unified generative model. To the best of our knowledge, this is the first end-to-end generative model that significantly surpasses current complex and well-designed recommender systems in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder structure, which encodes the user's historical behavior sequences and gradually decodes the videos that the user may be interested in. We adopt sparse Mixture-of-Experts (MoE) to scale model capacity without proportionally increasing computational FLOPs. 2) a session-wise generation approach. In contrast to traditional next-item prediction, we propose a session-wise generation, which is more elegant and contextually coherent than point-by-point generation that relies on hand-crafted rules to properly combine the generated results. 3) an Iterative Preference Alignment module combined with Direct Preference Optimization (DPO) to enhance the quality of the generated results. Unlike DPO in NLP, a recommendation system typically has only one opportunity to display results for each user's browsing request, making it impossible to obtain positive and negative samples simultaneously. To address this limitation, We design a reward model to simulate user generation and customize the sampling strategy. Extensive experiments have demonstrated that a limited number of DPO samples can align user interest preferences and significantly improve the quality of generated results. We deployed OneRec in the main scene of Kuaishou, achieving a 1.6\% increase in watch-time, which is a substantial improvement.

Paper Structure

This paper contains 23 sections, 11 equations, 6 figures, 2 tables, 2 algorithms.

Figures (6)

  • Figure 1: (a) Our proposed unified architecture for end-to-end generation. (b) A typical cascade ranking system, which includes three stages from the bottom to the top: Retrieval, Pre-ranking, and Ranking.
  • Figure 2: The overall framework of OneRec, consists of two stages: (i) the session training stage which train OneRec with session-wise data; (ii) the IPA stage which utilizes iterative direct preference optimization with self-hard negatives.
  • Figure 3: Framework of Online Deployment of OneRec.
  • Figure 4: The ablation study on DPO sample ratio $r_{\rm DPO}$. The results indicate that a 1% ratio of DPO training leads to significant gains but further increase the sample ratio results in limited improvements.
  • Figure 5: The visualization of the probability distribution of the softmax output for each layer of the semantic ID. The red star represents the sematic ID of item which has the highest reward value.
  • ...and 1 more figures