WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
TL;DR
The paper tackles the gap between open-source and proprietary agentic systems in complex information-seeking by identifying data diversity of uncertainty and scalable training as key bottlenecks. It introduces SailorFog-QA-V2 to generate dense, cycle-rich knowledge graphs and a broader set of uncertainties, and details WebSailor-V2’s post-training pipeline combining SFT cold-start with agentic RL in a simulated offline environment and a robust tool toolkit. Empirically, WebSailor-V2-30B-A3B achieves state-of-the-art results among open-source models on BrowseComp benchmarks and competitive performance on HLE and the DeepResearch Bench, outperforming many larger or open-source rivals and narrowing the proprietary gap. The work provides a practical, open-source blueprint—emphasizing data quality and training stability over algorithmic novelty—that can guide future development of scalable, tool-enabled agents.
Abstract
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.
