Table of Contents
Fetching ...

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, Siheng Chen

Abstract

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

Abstract

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet the development of high-performance search agents remains dominated by industrial giants due to a lack of transparent, high-quality training data. This persistent data scarcity has fundamentally hindered the progress of the broader research community in developing and innovating within this domain. To bridge this gap, we introduce OpenSeeker, the first fully open-source search agent (i.e., model and data) that achieves frontier-level performance through two core technical innovations: (1) Fact-grounded scalable controllable QA synthesis, which reverse-engineers the web graph via topological expansion and entity obfuscation to generate complex, multi-hop reasoning tasks with controllable coverage and complexity. (2) Denoised trajectory synthesis, which employs a retrospective summarization mechanism to denoise the trajectory, therefore promoting the teacher LLMs to generate high-quality actions. Experimental results demonstrate that OpenSeeker, trained (a single training run) on only 11.7k synthesized samples, achieves state-of-the-art performance across multiple benchmarks including BrowseComp, BrowseComp-ZH, xbench-DeepSearch, and WideSearch. Notably, trained with simple SFT, OpenSeeker significantly outperforms the second-best fully open-source agent DeepDive (e.g., 29.5% v.s. 15.3% on BrowseComp), and even surpasses industrial competitors such as Tongyi DeepResearch (trained via extensive continual pre-training, SFT, and RL) on BrowseComp-ZH (48.4% v.s. 46.7%). We fully open-source the complete training dataset and the model weights to democratize frontier search agent research and foster a more transparent, collaborative ecosystem.
Paper Structure (18 sections, 2 equations, 6 figures, 4 tables)

This paper contains 18 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: OpenSeeker stands out as the only fully open-source agent that achieves competitive performance on four search benchmarks, remarkably accomplishing this via simple SFT in a single training trial..
  • Figure 2: Overview of Fact-grounded scalable controllable QA synthesis. The pipeline begins with Graph Expansion, where a seed node is expanded into a subgraph of connected pages. Entity Extraction then distills key information themes into a structured Entity Subgraph. A generator synthesizes complex initial questions conditioned on this structure (Question Generation), ensuring multi-hop reasoning requirements. To enhance difficulty, we apply Entity Obfuscation to vagueify specific terms, finally producing a challenging question that necessitates deep graph traversal to solve.
  • Figure 3: Overview of Denoised Trajectory Synthesis. We employ a retrospective summarization mechanism where, after each tool call, the raw tool response from the previous turn is condensed into a 'Summarized Response' that replaces the original raw tool response in the history window. This cleaner context enables the teacher to generate high-quality reasoning and actions. Note the asymmetry: while synthesis relies on summarized context, the training and inference phases operate on raw tool response to force the model to learn intrinsic denoising capabilities.
  • Figure 4: Comparison of difficulty between OpenSeeker-v1-Data-ZH and BrowseComp-ZH using the same model for inference. OpenSeeker-v1-Data-ZH exhibits significantly higher average token counts and tool call counts than BrowseComp-ZH.
  • Figure 5: Comparison of difficulty between OpenSeeker-v1-Data-EN and BrowseComp-EN using the same model for inference. OpenSeeker-v1-Data-EN exhibits difficulty comparable to that of BrowseComp-EN.
  • ...and 1 more figures