Table of Contents
Fetching ...

Co-Evolution of Policy and Internal Reward for Language Agents

Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, Bang Liu

Abstract

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

Co-Evolution of Policy and Internal Reward for Language Agents

Abstract

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

Paper Structure

This paper contains 36 sections, 10 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Sparse environment reward alone may fail to distinguish trajectories with different intermediate quality, leaving GRPO with weakly separated training signals. Our method interleaves self-guided steps with action steps during trajectory generation, so that the same signal can guide action selection at inference time and be aggregated as internal reward at training time. This yields denser and more discriminative trajectory-level supervision for policy optimization.
  • Figure 2: Comparison between baseline GRPO and GRPO with Self-Guide. Baseline GRPO optimizes a policy using sparse trajectory-level environment rewards. Our method augments each step with a verbal self-guidance signal $z_t$: the model first generates $z_t$ to assess the current trajectory, and then produces action $a_t$ conditioned on $z_t$. The same self-guidance signals are mapped to step-level internal rewards, aggregated into $R_{\mathrm{sg}}(\tau)$, and combined with $R_{\mathrm{env}}(\tau)$ via a stage-dependent coefficient $\lambda(u)$ for joint policy optimization.
  • Figure 3: Self-guidance without training already improves performance in structured environments (ALFWorld) but yields inconsistent gains in more complex ones (WebShop), indicating that self-guidance quality depends on task familiarity.
  • Figure 4: Training curves on the three environments with Qwen3-1.7B as base model.
  • Figure 5: Ablation on stage-wise guidance-reward scheduling. Left: Training curves under different reward schedules. Right: Final validation success rates.
  • ...and 8 more figures