Table of Contents
Fetching ...

OneRec-V2 Technical Report

Guorui Zhou, Hengrui Hu, Hongtao Cheng, Huanjie Wang, Jiaxin Deng, Jinghao Zhang, Kuo Cai, Lejian Ren, Lu Ren, Liao Yu, Pengfei Zheng, Qiang Luo, Qianqian Wang, Qigen Hu, Rui Huang, Ruiming Tang, Shiyao Wang, Shujie Yang, Tao Wu, Wuchao Li, Xinchen Luo, Xingmei Wang, Yi Su, Yunfan Wu, Zexuan Cheng, Zhanyu Liu, Zixing Zhang, Bin Zhang, Boxuan Wang, Chaoyi Ma, Chengru Song, Chenhui Wang, Chenglong Chu, Di Wang, Dongxue Meng, Dunju Zang, Fan Yang, Fangyu Zhang, Feng Jiang, Fuxing Zhang, Gang Wang, Guowang Zhang, Han Li, Honghui Bao, Hongyang Cao, Jiaming Huang, Jiapeng Chen, Jiaqiang Liu, Jinghui Jia, Kun Gai, Lantao Hu, Liang Zeng, Qiang Wang, Qidong Zhou, Rongzhou Zhang, Shengzhe Wang, Shihui He, Shuang Yang, Siyang Mao, Sui Huang, Tiantian He, Tingting Gao, Wei Yuan, Xiao Liang, Xiaoxiao Xu, Xugang Liu, Yan Wang, Yang Zhou, Yi Wang, Yiwu Liu, Yue Song, Yufei Zhang, Yunfeng Zhao, Zhixin Ling, Ziming Li

TL;DR

This work reengineers end-to-end generative recommender systems by replacing the encoder-heavy architecture with a lazy decoder-only design, enabling scalable deployment up to 8B parameters under fixed compute. It couples post-training preference alignment with real-world user feedback signals through duration-aware reward shaping and Gradient-Bounded Policy Optimization to mitigate reward-hacking and improve sample efficiency. Extensive online A/B tests on the Kuaishou platform demonstrate meaningful gains in engagement metrics and app stay time, validating both architectural efficiency and alignment with user behavior. Overall, the approach advances scalable, end-to-end generative recommendations by aligning model optimization with authentic user feedback signals and robust training dynamics.

Abstract

Recent breakthroughs in generative AI have transformed recommender systems through end-to-end generation. OneRec reformulates recommendation as an autoregressive generation task, achieving high Model FLOPs Utilization. While OneRec-V1 has shown significant empirical success in real-world deployment, two critical challenges hinder its scalability and performance: (1) inefficient computational allocation where 97.66% of resources are consumed by sequence encoding rather than generation, and (2) limitations in reinforcement learning relying solely on reward models. To address these challenges, we propose OneRec-V2, featuring: (1) Lazy Decoder-Only Architecture: Eliminates encoder bottlenecks, reducing total computation by 94% and training resources by 90%, enabling successful scaling to 8B parameters. (2) Preference Alignment with Real-World User Interactions: Incorporates Duration-Aware Reward Shaping and Adaptive Ratio Clipping to better align with user preferences using real-world feedback. Extensive A/B tests on Kuaishou demonstrate OneRec-V2's effectiveness, improving App Stay Time by 0.467%/0.741% while balancing multi-objective recommendations. This work advances generative recommendation scalability and alignment with real-world feedback, representing a step forward in the development of end-to-end recommender systems.

OneRec-V2 Technical Report

TL;DR

This work reengineers end-to-end generative recommender systems by replacing the encoder-heavy architecture with a lazy decoder-only design, enabling scalable deployment up to 8B parameters under fixed compute. It couples post-training preference alignment with real-world user feedback signals through duration-aware reward shaping and Gradient-Bounded Policy Optimization to mitigate reward-hacking and improve sample efficiency. Extensive online A/B tests on the Kuaishou platform demonstrate meaningful gains in engagement metrics and app stay time, validating both architectural efficiency and alignment with user behavior. Overall, the approach advances scalable, end-to-end generative recommendations by aligning model optimization with authentic user feedback signals and robust training dynamics.

Abstract

Recent breakthroughs in generative AI have transformed recommender systems through end-to-end generation. OneRec reformulates recommendation as an autoregressive generation task, achieving high Model FLOPs Utilization. While OneRec-V1 has shown significant empirical success in real-world deployment, two critical challenges hinder its scalability and performance: (1) inefficient computational allocation where 97.66% of resources are consumed by sequence encoding rather than generation, and (2) limitations in reinforcement learning relying solely on reward models. To address these challenges, we propose OneRec-V2, featuring: (1) Lazy Decoder-Only Architecture: Eliminates encoder bottlenecks, reducing total computation by 94% and training resources by 90%, enabling successful scaling to 8B parameters. (2) Preference Alignment with Real-World User Interactions: Incorporates Duration-Aware Reward Shaping and Adaptive Ratio Clipping to better align with user preferences using real-world feedback. Extensive A/B tests on Kuaishou demonstrate OneRec-V2's effectiveness, improving App Stay Time by 0.467%/0.741% while balancing multi-objective recommendations. This work advances generative recommendation scalability and alignment with real-world feedback, representing a step forward in the development of end-to-end recommender systems.

Paper Structure

This paper contains 43 sections, 25 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Left: Scaling curves for various model architectures from 0.1B to 8B parameters, among which Lazy Decoder-only models demonstrate best scaling efficiency. Right: OneRec-V1 v.s. OneRec-V2 at 1B parameters.
  • Figure 2: The overall architecture and post-training framework of OneRec-V2. The left panel illustrates the Lazy Decoder-Only Architecture, The right panel depicts the post-training preference alignment process
  • Figure 3: Naive Impression Organization: The pattern A$\rightarrow$B is redundantly trained across multiple impressions. User-Centric Organization: When training on User-2's data at time $t_3$, the model has already learned the pattern B$\rightarrow$C from User-1's future interactions at $t_4$. New Impression Only Organization: It trains only on the newest impression.
  • Figure 4: Architecture of the proposed lazy decoder-only generative recommender. The Context Processor transforms heterogeneous user feature pathways into unified context representations, which are then normalized to produce layer-shared key-value pairs for cross-attention. The Lazy Decoder processes BOS token and tokenized semantic IDs of the target item through stacked transformer blocks. Each block comprises: (1) lazy cross-attention without key-value projections, enabling Grouped Query Attention (GQA); (2) causal self-attention; and (3) a feed-forward network. The final representations are projected to predict semantic IDs for next-item recommendation.
  • Figure 5: Training curves for different architectures across three model scales. Despite achieving similar loss, Lazy Decoder-Only architecture requires 10× fewer FLOPs than classic architectures. E1D1 and E1D2 denote encoder-decoder parameter ratios of 1:1 and 1:2, respectively.
  • ...and 7 more figures