Agentic Policy Optimization via Instruction-Policy Co-Evolution
Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen
TL;DR
This work tackles the sensitivity of agent performance to static instructions in reinforcement learning with verifiable rewards (RLVR) by introducing INSPO, a framework that co-evolves instructions and policy in an online loop. INSPO maintains a dynamic population of instruction candidates and employs an experience-driven reflection mechanism to generate new instructions, coupled with a pruning-and-verification cycle to ensure stability. Through extensive multi-turn, tool-using QA benchmarks, INSPO demonstrates substantial performance gains over static-instruction baselines and other tool-based methods, with only marginal additional computation. The results suggest that online, reflective instruction optimization can significantly enhance strategic reasoning paths in large language model agents, reducing manual prompt engineering while improving robustness and versatility. Overall, INSPO advances agent-centric RL by making instructions an adaptive, learnable component of the training process.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.
