Table of Contents
Fetching ...

Agentic Policy Optimization via Instruction-Policy Co-Evolution

Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen

TL;DR

This work tackles the sensitivity of agent performance to static instructions in reinforcement learning with verifiable rewards (RLVR) by introducing INSPO, a framework that co-evolves instructions and policy in an online loop. INSPO maintains a dynamic population of instruction candidates and employs an experience-driven reflection mechanism to generate new instructions, coupled with a pruning-and-verification cycle to ensure stability. Through extensive multi-turn, tool-using QA benchmarks, INSPO demonstrates substantial performance gains over static-instruction baselines and other tool-based methods, with only marginal additional computation. The results suggest that online, reflective instruction optimization can significantly enhance strategic reasoning paths in large language model agents, reducing manual prompt engineering while improving robustness and versatility. Overall, INSPO advances agent-centric RL by making instructions an adaptive, learnable component of the training process.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.

Agentic Policy Optimization via Instruction-Policy Co-Evolution

TL;DR

This work tackles the sensitivity of agent performance to static instructions in reinforcement learning with verifiable rewards (RLVR) by introducing INSPO, a framework that co-evolves instructions and policy in an online loop. INSPO maintains a dynamic population of instruction candidates and employs an experience-driven reflection mechanism to generate new instructions, coupled with a pruning-and-verification cycle to ensure stability. Through extensive multi-turn, tool-using QA benchmarks, INSPO demonstrates substantial performance gains over static-instruction baselines and other tool-based methods, with only marginal additional computation. The results suggest that online, reflective instruction optimization can significantly enhance strategic reasoning paths in large language model agents, reducing manual prompt engineering while improving robustness and versatility. Overall, INSPO advances agent-centric RL by making instructions an adaptive, learnable component of the training process.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.

Paper Structure

This paper contains 15 sections, 6 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of Inspo: In phase 1, Inspo maintains a dynamic population of instruction candidates. For each sampled question, the instruction is sampled based on a selection probability weighted by the importance of each instruction. The reward signals not only update the policy model but also update the importance of instructions. In addition, Inspo involves a replay buffer that prioritizes failure trajectories or trajectories that come with low rewards (marked in red) for experience-driven self-reflection. In phase 2, the history of experience then provides a correction signal to an LLM-based instruction-proposer module, which analyzes the failure cases and evolves new instructions via self-reflection. New instructions are then passed for verification, where the top-performing candidates are merged into the active population of instructions.
  • Figure 2: (a) Inspo vs. Search-R1: Inspo shows a better reward at convergence compared to the Search-R1 baseline. (b) Number of tool calls: Inspo discovers instructions that lead agents to leverage a larger number of tool usages for solving the problem, whereas the baseline converges to a single-turn tool-use. (c) Prompt length: Periodically, Inspo evolves longer and more effective instructions along the RL training process, whereas the baseline sticks to a static instruction. (d) Response length: With a larger number of tool calls by Inspo, the converged response comes with more tokens, which contains richer information from the search engine.
  • Figure 3: A demonstration of the instruction co-evolution process with the policy model for using the search tool. The policy model is first prompted with the instruction-question pair, generating trajectories with environmental feedback and rewards, which are passed to an LLM-based optimizer for the experience-driven reflection process. The optimizer generates critiques on the failures and proposes new instruction candidates, forming an online optimization loop for the instruction.