Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization
Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu, Xia Zeng
TL;DR
This paper tackles two core bottlenecks in agentic RL for Tool-Integrated Reasoning: sparse, binary rewards and gradient degradation in group-relative policy optimization. It introduces Progressive Reward Shaping (PRS), a curriculum-inspired reward design that provides dense, stage-wise feedback, with short-form QA employing a length-aware BLEU and long-form QA using an LLM-as-a-Judge to prevent reward hacking. It also presents Value-based Sampling Policy Optimization (VSPO), an enhanced GRPO variant that prioritizes informative samples via a task-value metric and stabilizes learning with value smoothing clipping. Across short- and long-form QA benchmarks, PRS improves learning efficiency and reward guidance, while VSPO delivers more stable training, faster convergence, and higher final performance than PPO, GRPO, CISPO, and SFT baselines, enabling better generalization of LLM-based TIR agents across domains.
Abstract
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
