Table of Contents
Fetching ...

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

Nandan Thakur, Zijian Chen, Xueguang Ma, Jimmy Lin

Abstract

Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question-answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4-5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

Abstract

Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question-answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4-5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.

Paper Structure

This paper contains 45 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: An end-to-end data instance in the ORBIT dataset generated using our framework. The procedure involves the following stages: (1) seed creation, (2) question--answer pair generation, (3) self-verification with DeepSeek Chat, and (4) external verification with scraped web page URLs. Each stage is described in detail in Section \ref{['sec:dataset-creation']}.
  • Figure 2: Validation EM accuracy of Search-R1-4B, InfoSeeker-4B and ORBIT-4B on 160 training steps on Wikipedia datasets, each with 125 randomly sampled validation pairs. The accuracy drops observed during training primarily occur due to DDGS web search retriever, that can potentially downgrade search results when servers are busy (e.g., google$\rightarrow$bing).
  • Figure 3: Prompt template used for search agent training in ORBIT. The agent interleaves <think> reasoning </think> blocks with <search>query</search> calls; retrieved passages are returned as <information> documents </information>, and the trajectory terminates when the model emits a final <answer>answer</answer>.
  • Figure 4: Prompt template for question--answer pair generation for a given input seed as inspiration and shuffled exemplars.
  • Figure 5: Prompt template for self-verification given the input question and answer.
  • ...and 3 more figures