Table of Contents
Fetching ...

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, Kyungjae Lee

TL;DR

This work introduces HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning, and introduces HDS-QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning-query0retrieval trajectories containing parallel sub-queries.

Abstract

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, we find that existing approaches rarely demonstrate test-time search scaling. Methods that extend reasoning through single-query sequential search suffer from limited evidence coverage, while approaches that generate multiple independent queries per step often lack structured aggregation, hindering deeper sequential reasoning. We propose a hybrid search strategy to address these limitations. We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning. To supervise this behavior, we introduce HDS-QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning-query0retrieval trajectories containing parallel sub-queries. Across five benchmarks, HybridDeepSearcher significantly outperforms the state-of-the-art, improving F1 scores by +15.9 on FanOutQA and +9.2 on a subset of BrowseComp. Further analysis shows its consistent test-time search scaling: performance improves as additional search turns or calls are allowed, while competing methods plateau.

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

TL;DR

This work introduces HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning, and introduces HDS-QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning-query0retrieval trajectories containing parallel sub-queries.

Abstract

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, we find that existing approaches rarely demonstrate test-time search scaling. Methods that extend reasoning through single-query sequential search suffer from limited evidence coverage, while approaches that generate multiple independent queries per step often lack structured aggregation, hindering deeper sequential reasoning. We propose a hybrid search strategy to address these limitations. We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning. To supervise this behavior, we introduce HDS-QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning-query0retrieval trajectories containing parallel sub-queries. Across five benchmarks, HybridDeepSearcher significantly outperforms the state-of-the-art, improving F1 scores by +15.9 on FanOutQA and +9.2 on a subset of BrowseComp. Further analysis shows its consistent test-time search scaling: performance improves as additional search turns or calls are allowed, while competing methods plateau.

Paper Structure

This paper contains 48 sections, 1 equation, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Test-time Search Scaling on BrowseComp$^\dagger$. For our method, evaluation is conducted by scaling two types of search resources: (1) latency measured by the maximum number of search turns ($M_T = 1, 2, 4, 8$), and (2) search cost measured by the maximum number of search calls ($M_C = 2, 4, 8, 16$). The x-axes report the average number of turns/calls actually used under each budget. Our model is required to output a final answer once either resource limit is exhausted. For other baselines, we allow a maximum of 10 turns with unlimited API call limits. The results on the other benchmarks are provided in \ref{['appx:experimental_details']}.
  • Figure 2: Pipeline for HDS-QA question generation.
  • Figure 3: Trade-off between effectiveness and efficiency. We compare mean Acc scores by the number of search turns (upper) and search API calls (lower). At each turn or API call, we compute the mean Acc scores across all datapoints, assigning a score of 0 if unanswered within the allowed turns or calls.
  • Figure 4: Acc grouped by the number of gold evidence on MuSiQue, FanOutQA, and FRAMES.
  • Figure 5: Test-Time Search Scaling results: (a) number of turns and (b) number of API calls.
  • ...and 1 more figures