Table of Contents
Fetching ...

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Siwei Zhang, Yun Xiong, Xi Chen, Zi'an Jia, Renhong Huang, Jiarong Xu, Jiawei Zhang

TL;DR

This paper revisits exploration in Agentic RL and proposes Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training and introduces the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration.

Abstract

Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

TL;DR

This paper revisits exploration in Agentic RL and proposes Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training and introduces the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration.

Abstract

Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.
Paper Structure (56 sections, 15 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 56 sections, 15 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison between existing methods and our framework. (a) Existing Agentic RL methods are inherently on-policy, resulting in a limited exploration space bounded by the native agent. (b) Off-policy-enhanced RL methods statically integrate full off-policy trajectories for trajectory-level policy estimation, failing to capture the dynamic, step-level exploration within agentic rollout. (c) Our RAPO introduces retrieval and allows the on-policy agent to continuously reason over the retrieved off-policy step-level traces, explicitly expanding its reasoning receptive field for exploration and thus increasing rollout diversity.
  • Figure 2: Overview of the RAPO. RAPO introduces a Hybrid-policy Agentic Rollout strategy that supports off-policy-conditioned reasoning, which enables the agent to receive the retrieved off-policy traces to broaden exploration beyond its intrinsic reasoning behaviors. Meanwhile, it incorporates a Retrieval-aware Policy Optimization mechanism with retrieval reward and importance shaping, ensuring effective and stable policy gradient estimation during training.
  • Figure 3: Efficiency study. RAPO exhibits clear training efficiency in rollout time, policy update time, the number of rollout tokens, and the number of tool calls.
  • Figure 4: Comparison analysis between pure on-policy rollouts and hybrid-policy rollouts.
  • Figure 5: Robustness study for different off-policy models.
  • ...and 9 more figures