Table of Contents
Fetching ...

RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation

Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang

TL;DR

The paper addresses knowledge-intensive QA by optimizing agentic RAG through a unified framework, RAG-Gym. It introduces Re^2Search with reasoning reflection and analyzes three optimization dimensions: prompt engineering, actor tuning, and critic training. Through extensive experiments across four datasets, it demonstrates that process-level supervision and critic-guided inference yield substantial and generalizable gains, culminating in the Re^2Search++ agent that outperforms several RL-based outcome-supervision methods, especially on unseen data. The work also offers practical insights on reward sources and scaling properties, providing a foundation for robust, scalable agentic RAG deployments.

Abstract

Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with external knowledge sources for adaptive information retrieval. However, existing agentic RAG methods often depend on ad-hoc prompt engineering and lack a unified optimization framework. We introduce RAG-Gym, a comprehensive platform that systematically explores three optimization dimensions: (1) prompt engineering, (2) actor tuning, and (3) critic training. For prompt engineering, we propose Re$^2$Search, a novel agent incorporating reasoning reflection that significantly outperforms standard prompts. In actor tuning, we evaluate three popular post-training algorithms with fine-grained process supervision and identify direct preference optimization as the most effective. We further demonstrate that a trained critic can enhance inference by selecting higher-quality intermediate reasoning steps. Together, these findings lead to the optimized Re$^2$Search++ agent, which surpasses most recent methods like Search-R1 by a relative increase of 3.2% to 11.6% in average F1. Finally, we examine the impact of different reward sources and analyze scaling properties in training and inference, offering practical insights for agentic RAG optimization. The project homepage is available at https://rag-gym.github.io.

RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation

TL;DR

The paper addresses knowledge-intensive QA by optimizing agentic RAG through a unified framework, RAG-Gym. It introduces Re^2Search with reasoning reflection and analyzes three optimization dimensions: prompt engineering, actor tuning, and critic training. Through extensive experiments across four datasets, it demonstrates that process-level supervision and critic-guided inference yield substantial and generalizable gains, culminating in the Re^2Search++ agent that outperforms several RL-based outcome-supervision methods, especially on unseen data. The work also offers practical insights on reward sources and scaling properties, providing a foundation for robust, scalable agentic RAG deployments.

Abstract

Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with external knowledge sources for adaptive information retrieval. However, existing agentic RAG methods often depend on ad-hoc prompt engineering and lack a unified optimization framework. We introduce RAG-Gym, a comprehensive platform that systematically explores three optimization dimensions: (1) prompt engineering, (2) actor tuning, and (3) critic training. For prompt engineering, we propose ReSearch, a novel agent incorporating reasoning reflection that significantly outperforms standard prompts. In actor tuning, we evaluate three popular post-training algorithms with fine-grained process supervision and identify direct preference optimization as the most effective. We further demonstrate that a trained critic can enhance inference by selecting higher-quality intermediate reasoning steps. Together, these findings lead to the optimized ReSearch++ agent, which surpasses most recent methods like Search-R1 by a relative increase of 3.2% to 11.6% in average F1. Finally, we examine the impact of different reward sources and analyze scaling properties in training and inference, offering practical insights for agentic RAG optimization. The project homepage is available at https://rag-gym.github.io.

Paper Structure

This paper contains 44 sections, 1 equation, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of the RAG-Gym framework. RAG-Gym employs a modular design, comprising prompt engineering, actor tuning, and critic training, to systematically optimize agentic RAG performance. By leveraging all three components, RAG-Gym improves the F1 score of the ReAct agent on HotpotQA from 41.09% to 60.19%.
  • Figure 2: Performance improvements across various agents with critics.
  • Figure 3: Performance of Re$^2$Search agents with critics trained on different numbers of samples.
  • Figure 4: Performance of Re$^2$Search agents with different numbers of actions sampled per step.
  • Figure 5: Pipeline of the process data collection in RAG-Gym. Process reward data is collected by randomly sampling action candidates at each time step and using an external annotator (e.g., GPT-4o) to select the best one. The episode is terminated when the agent generates a final answer.
  • ...and 5 more figures