Table of Contents
Fetching ...

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He

TL;DR

WebExplorer tackles the data scarcity barrier in long-horizon web agents by autonomously synthesizing challenging QA pairs through model-based exploration and iterative query evolution. The two-stage training—supervised fine-tuning followed by GRPO reinforcement learning—produces an 8B model that achieves state-of-the-art performance on multiple information-seeking benchmarks at its scale, while generalizing to non-information tasks like HLE. Key contributions include the WebExplorer-QA dataset (~40K evolved QA pairs), a two-tool (search, browse) scaffold within a ReAct framework, and a training regime that scales context length to 128K with up to 100 tool calls. The results demonstrate practical viability for long-horizon web agents and offer a pathway for open-source development of more capable autonomous information seekers.

Abstract

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents

TL;DR

WebExplorer tackles the data scarcity barrier in long-horizon web agents by autonomously synthesizing challenging QA pairs through model-based exploration and iterative query evolution. The two-stage training—supervised fine-tuning followed by GRPO reinforcement learning—produces an 8B model that achieves state-of-the-art performance on multiple information-seeking benchmarks at its scale, while generalizing to non-information tasks like HLE. Key contributions include the WebExplorer-QA dataset (~40K evolved QA pairs), a two-tool (search, browse) scaffold within a ReAct framework, and a training regime that scales context length to 128K with up to 100 tool calls. The results demonstrate practical viability for long-horizon web agents and offer a pathway for open-source development of more capable autonomous information seekers.

Abstract

The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.

Paper Structure

This paper contains 20 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Performance comparison on BrowseComp-en, BrowseComp-zh and HLE benchmarks across different models.
  • Figure 2: Model-Based Exploration and Iterative Query Evolution Framework. Starting from a seed entity (e.g., David Hackett Souter), the framework employs iterative search and browsing actions to construct the information space related to the seed entity. Initial queries ($Q_0$) and Answers are generated based on this explored information space. Through iterative evolution, salient information is systematically obfuscated (e.g., "Remove Birth...", "Replace ..." or "Vague Date...") to produce more challenging queries ($Q_1$ to $Q_n$). This process ensures the resulting queries require longer reasoning steps and explorations.
  • Figure 3: Illustration of model-based exploration and initial Query-Answer pair synthesis. Starting from the seed "Brazil National Team", the model iteratively explores using ●S (Search) and ●B (Browse) actions to discover interconnected facts, then synthesizes a challenging query--answer pair that requires deep reasoning across multiple discovered connections.
  • Figure 4: Tool calling turns distribution comparisons using OpenAI-o3: Initial QA vs Evolved QA (left) and Evolved QA vs BrowseComp-en (right).
  • Figure 5: Left: Average # tool calls per trajectory during the RL training process. Each tool call (search or browse) is counted separately. Middle: Average trajectory length (# tokens) during the RL training process. Right: The avg@4 scores of BrowseComp-en and BrowseComp-zh during the RL training process.