Table of Contents
Fetching ...

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu

TL;DR

The paper presents ASearcher, an open-source framework for large-scale, fully asynchronous agentic RL training of search agents. It combines long-horizon trajectory learning with a QA-data synthesis pipeline to produce challenging, grounded QA pairs, enabling agents to perform complex, multi-turn search and summarization without external LLMs. Empirical results show strong improvements on GAIA, xBench-DeepSearch, and Frames, with the Web-QwQ variant achieving state-of-the-art open-source performance and competitive results with commercial systems under test-time scaling. By releasing models, data, and code, the work aims to democratize access to scalable, high-quality open-source search agents and spur broader adoption of agentic RL for real-world tasks.

Abstract

Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 78.0% and 34.3% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 100 turns and output tokens exceeding 400k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 51.1 on xBench and 58.7 on GAIA, surpassing existing open-source 32B agents. Finally, we also show that ASearcher-Web-QwQ could achieve performance of commercial systems using external summary tool in a zero-shot transfer manner and test-time search. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

TL;DR

The paper presents ASearcher, an open-source framework for large-scale, fully asynchronous agentic RL training of search agents. It combines long-horizon trajectory learning with a QA-data synthesis pipeline to produce challenging, grounded QA pairs, enabling agents to perform complex, multi-turn search and summarization without external LLMs. Empirical results show strong improvements on GAIA, xBench-DeepSearch, and Frames, with the Web-QwQ variant achieving state-of-the-art open-source performance and competitive results with commercial systems under test-time scaling. By releasing models, data, and code, the work aims to democratize access to scalable, high-quality open-source search agents and spur broader adoption of agentic RL for real-world tasks.

Abstract

Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 78.0% and 34.3% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 100 turns and output tokens exceeding 400k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 51.1 on xBench and 58.7 on GAIA, surpassing existing open-source 32B agents. Finally, we also show that ASearcher-Web-QwQ could achieve performance of commercial systems using external summary tool in a zero-shot transfer manner and test-time search. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

Paper Structure

This paper contains 57 sections, 1 equation, 12 figures, 5 tables.

Figures (12)

  • Figure 1: (Left) Asynchronous RL brings substantial improvements: Through RL training, our agent, ASearcher-Web-QwQ, obtains +15.0, +22.4, and +15.6 improvements on GAIA, xBench, and Frames, respectively. (Middle) & (Right) Through RL training, ASearcher-Web-QwQ learns to conduct long-horizon search, with tool calls exceeding 100 turns and output tokens exceeding 400k during training. The agent also learns expert-level search strategies (See case study in Sec. \ref{['sec:limit-of-current']})
  • Figure 2: Comparison between ASearcher and Search-R1. (Left) Search-R1 is only equipped with search tools and lacks web browsing capability. (Right) ASearcher utilizes a simple agent design with two basic tools including search and browsing tools, without relying on any external LLM. ASearcher is a comprehensive agent capable of both reasoning and summarizing lengthy web contents. Notably, both reasoning and summarization abilities are optimized through end-to-end RL training.
  • Figure 3: A case study on a complex query from GAIA. Search-R1-32B is unable to break down the complex question and has severe hallucinations. Search-o1 (QwQ) can identify the corrects articles through extensive tool calls, but easily misses key information and fails to verify wrong conclusions. Our end-to-end RL agent, ASearcher-Web-QwQ, exhibits key behaviors featuring Search Intelligence: uncertainty-aware reasoning (list and examine candidate answers), precise extraction from noisy contents, cross-document inference, and grounded verification.
  • Figure 4: Data Synthesis Agent. Starting from a seed QA, the data synthesis agent iteratively modifies the question through two actions, Injection and Fuzz. Through injection, the agent enriches the question by adding some external facts. Through Fuzz, the agent blurs certain information to increase uncertainty and difficulty. The related fact to the question are tracked during the synthesis process. Each time the question is modified, a quality verification step is applied to ensure quality and difficulty of the synthetic questions.
  • Figure 5: Statistics from our data synthesis process. (Left) The distribution of the number of supporting facts. (Middle) The distribution of the number of fuzz actions and injection actions. (Right) The accuracy distribution of QwQ-32B in answering the generated questions without using any tools.
  • ...and 7 more figures