Table of Contents
Fetching ...

Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

Jie Jiang, Yangru Huang, Zeyu Wang, Changping Wang, Yuling Xiong, Jun Zhang, Huan Yu

TL;DR

V-STAR tackles probability–reward misalignment in SID-based generative recommendations by coupling Value-Guided Efficient Decoding (VED) with tree-structured reinforcement learning via Sibling-GRPO. VED performs budgeted, value-aware expansion to surface high-potential prefixes, while Sibling-GRPO concentrates learning signals on decisive branching decisions within prefix groups. Experiments on offline and online data show consistent gains in accuracy, diversity, and commercial impact under strict token budgets, with notable improvements in HR, NDCG, and GMV metrics. The approach offers a practical pathway to more effective long-tail discovery and stable RL-based alignment for generative recommenders in production settings.

Abstract

Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.

Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

TL;DR

V-STAR tackles probability–reward misalignment in SID-based generative recommendations by coupling Value-Guided Efficient Decoding (VED) with tree-structured reinforcement learning via Sibling-GRPO. VED performs budgeted, value-aware expansion to surface high-potential prefixes, while Sibling-GRPO concentrates learning signals on decisive branching decisions within prefix groups. Experiments on offline and online data show consistent gains in accuracy, diversity, and commercial impact under strict token budgets, with notable improvements in HR, NDCG, and GMV metrics. The approach offers a practical pathway to more effective long-tail discovery and stable RL-based alignment for generative recommenders in production settings.

Abstract

Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.
Paper Structure (20 sections, 19 equations, 4 figures, 6 tables)

This paper contains 20 sections, 19 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of Beams Search and V-STAR. Left: probability-based pruning removes high-reward items and produces homogeneous candidates. Right: V-STAR expands high-value prefixes under a fixed budget and strengthens within-group learning with Sibling-GRPO.
  • Figure 2: Overview of the proposed self-evolving decoding-and-learning framework V-STAR.
  • Figure 3: Spearman Correlation $\rho$ of Probability(P) and Value (V) Signals with Ground-truth Rewards.
  • Figure 4: Training time scaling with decoding token budget.