Table of Contents
Fetching ...

S-GRec: Personalized Semantic-Aware Generative Recommendation with Asymmetric Advantage

Jie Jiang, Hongbo Tang, Wenjie Wu, Yangru Huang, Zhenmao Li, Qian Li, Changping Wang, Jun Zhang, Huan Yu

TL;DR

S-GRec tackles the conflict between leveraging rich semantic priors from LLMs and maintaining strict business-objective alignment in industrial generative recommendation. It decouples semantic reasoning from online serving by using an offline Personalized Semantic Judge (PSJ) to produce interpretable aspect-level signals and a user-conditional aggregator, which together generate a holistic semantic reward. The semantic signals are fused with business rewards through Asymmetric Advantage Policy Optimization (A2PO), ensuring semantic guidance only reinforces directions consistent with business goals and stabilizing training with group-based GRPO. Empirical results on public benchmarks and a large-scale online deployment show consistent gains in CTR and GMV without incurring real-time LLM inference costs, demonstrating practical viability for semantic-aware generative recommendation at scale.

Abstract

Generative recommendation models sequence generation to produce items end-to-end, but training from behavioral logs often provides weak supervision on underlying user intent. Although Large Language Models (LLMs) offer rich semantic priors that could supply such supervision, direct adoption in industrial recommendation is hindered by two obstacles: semantic signals can conflict with platform business objectives, and LLM inference is prohibitively expensive at scale. This paper presents S-GRec, a semantic-aware framework that decouples an online lightweight generator from an offline LLM-based semantic judge for train-time supervision. S-GRec introduces a two-stage Personalized Semantic Judge (PSJ) that produces interpretable aspect evidence and learns user-conditional aggregation from pairwise feedback, yielding stable semantic rewards. To prevent semantic supervision from deviating from business goals, Asymmetric Advantage Policy Optimization (A2PO) anchors optimization on business rewards (e.g., eCPM) and injects semantic advantages only when they are consistent. Extensive experiments on public benchmarks and a large-scale production system validate both effectiveness and scalability, including statistically significant gains in CTR and a 1.19\% lift in GMV in online A/B tests, without requiring real-time LLM inference.

S-GRec: Personalized Semantic-Aware Generative Recommendation with Asymmetric Advantage

TL;DR

S-GRec tackles the conflict between leveraging rich semantic priors from LLMs and maintaining strict business-objective alignment in industrial generative recommendation. It decouples semantic reasoning from online serving by using an offline Personalized Semantic Judge (PSJ) to produce interpretable aspect-level signals and a user-conditional aggregator, which together generate a holistic semantic reward. The semantic signals are fused with business rewards through Asymmetric Advantage Policy Optimization (A2PO), ensuring semantic guidance only reinforces directions consistent with business goals and stabilizing training with group-based GRPO. Empirical results on public benchmarks and a large-scale online deployment show consistent gains in CTR and GMV without incurring real-time LLM inference costs, demonstrating practical viability for semantic-aware generative recommendation at scale.

Abstract

Generative recommendation models sequence generation to produce items end-to-end, but training from behavioral logs often provides weak supervision on underlying user intent. Although Large Language Models (LLMs) offer rich semantic priors that could supply such supervision, direct adoption in industrial recommendation is hindered by two obstacles: semantic signals can conflict with platform business objectives, and LLM inference is prohibitively expensive at scale. This paper presents S-GRec, a semantic-aware framework that decouples an online lightweight generator from an offline LLM-based semantic judge for train-time supervision. S-GRec introduces a two-stage Personalized Semantic Judge (PSJ) that produces interpretable aspect evidence and learns user-conditional aggregation from pairwise feedback, yielding stable semantic rewards. To prevent semantic supervision from deviating from business goals, Asymmetric Advantage Policy Optimization (A2PO) anchors optimization on business rewards (e.g., eCPM) and injects semantic advantages only when they are consistent. Extensive experiments on public benchmarks and a large-scale production system validate both effectiveness and scalability, including statistically significant gains in CTR and a 1.19\% lift in GMV in online A/B tests, without requiring real-time LLM inference.
Paper Structure (50 sections, 9 equations, 8 figures, 8 tables)

This paper contains 50 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: LLM applications in different ways. The dashed red line indicates LLM use only during training for our method, while the other two approaches require it in both training and inference.
  • Figure 2: Overview of S-GRec. The offline PSJ produces semantic rewards via two-stage scoring (aspect-level evidence $\rightarrow$ user-conditional aggregation). A2PO fuses semantic and business rewards in the advantage space with consistency gating, training a lightweight generator without serving-time LLM inference.
  • Figure 3: Asymmetric Advantage Fusion
  • Figure 4: HR@10 and NDCG@10 vs. semantic sampling ratio $p$ on Office and Industrial.
  • Figure 5: Relative HR lift (%) of S-GRec over MiniOneRec across novelty levels on the Office test set. $n$ denotes the number of test samples per bucket.
  • ...and 3 more figures