RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Siwei Zhang; Yun Xiong; Xi Chen; Zi'an Jia; Renhong Huang; Jiarong Xu; Jiawei Zhang

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Siwei Zhang, Yun Xiong, Xi Chen, Zi'an Jia, Renhong Huang, Jiarong Xu, Jiawei Zhang

TL;DR

This paper revisits exploration in Agentic RL and proposes Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training and introduces the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration.

Abstract

Agentic Reinforcement Learning (Agentic RL) has shown remarkable potential in large language model-based (LLM) agents. These works can empower LLM agents to tackle complex tasks via multi-step, tool-integrated reasoning. However, an inherent limitation of existing Agentic RL methods is their reliance on a pure on-policy paradigm for exploration, restricting exploration to the agent's self-generated outputs and preventing the discovery of new reasoning perspectives for further improvement. While recent efforts incorporate auxiliary off-policy signals to enhance exploration, they typically utilize full off-policy trajectories for trajectory-level policy estimation, overlooking the necessity for the fine-grained, step-level exploratory dynamics within agentic rollout. In this paper, we revisit exploration in Agentic RL and propose Retrieval-Augmented Policy Optimization (RAPO), a novel RL framework that introduces retrieval to explicitly expand exploration during training. To achieve this, we decompose the Agentic RL training process into two phases: (i) Hybrid-policy Agentic Rollout, and (ii) Retrieval-aware Policy Optimization. Specifically, we propose a Hybrid-policy Agentic Rollout strategy, which allows the agents to continuously reason over the retrieved off-policy step-level traces. It dynamically extends the reasoning receptive field of agents, enabling broader exploration conditioned on external behaviors. Subsequently, we introduce the Retrieval-aware Policy Optimization mechanism, which calibrates the policy gradient estimation with retrieval reward and importance shaping, stabilizing training and prioritizing retrieval-illuminating exploration. Extensive experiments show that RAPO achieves an +5.0% average gain on fourteen datasets across three agentic reasoning tasks, while delivering 1.2x faster training efficiency.

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

TL;DR

Abstract

Paper Structure (56 sections, 15 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 56 sections, 15 equations, 14 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Agentic Reinforcement Learning
On- and Off-Policy RL for LLM Training
Entropy-Related RL for LLM Training
Preliminaries
Problem Definition
Entropy Computation for Step-level Traces
Methodology
Hybrid-policy Agentic Rollout
Step-Trace Buffer
Retrieval from Step-Trace Buffer
Rollout Procedure
Retrieval-aware Policy Optimization
Retrieval Reward
...and 41 more sections

Figures (14)

Figure 1: Comparison between existing methods and our framework. (a) Existing Agentic RL methods are inherently on-policy, resulting in a limited exploration space bounded by the native agent. (b) Off-policy-enhanced RL methods statically integrate full off-policy trajectories for trajectory-level policy estimation, failing to capture the dynamic, step-level exploration within agentic rollout. (c) Our RAPO introduces retrieval and allows the on-policy agent to continuously reason over the retrieved off-policy step-level traces, explicitly expanding its reasoning receptive field for exploration and thus increasing rollout diversity.
Figure 2: Overview of the RAPO. RAPO introduces a Hybrid-policy Agentic Rollout strategy that supports off-policy-conditioned reasoning, which enables the agent to receive the retrieved off-policy traces to broaden exploration beyond its intrinsic reasoning behaviors. Meanwhile, it incorporates a Retrieval-aware Policy Optimization mechanism with retrieval reward and importance shaping, ensuring effective and stable policy gradient estimation during training.
Figure 3: Efficiency study. RAPO exhibits clear training efficiency in rollout time, policy update time, the number of rollout tokens, and the number of tool calls.
Figure 4: Comparison analysis between pure on-policy rollouts and hybrid-policy rollouts.
Figure 5: Robustness study for different off-policy models.
...and 9 more figures

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

TL;DR

Abstract

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (14)