Table of Contents
Fetching ...

DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Hanting Chen, Yasheng Wang, Lu Hou, Lifeng Shang

TL;DR

Open-domain information seeking is challenging for LLMs. The authors introduce WebPuzzle (24k training, 275 test) and DeepDiver, a cold-start SFT plus RL framework that learns adaptive search intensity scaling (SIS). Empirical results show SIS yields search-heavy, verifiable answers and enables 7B models to approach DeepSeek-R1's performance on real web tasks; DeepDiver generalizes to open-ended problems and Wiki-based tasks alike. The work also provides a rigorous benchmark and insights into reward design, generalization, and emergent search behaviors for future research.

Abstract

Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.

DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

TL;DR

Open-domain information seeking is challenging for LLMs. The authors introduce WebPuzzle (24k training, 275 test) and DeepDiver, a cold-start SFT plus RL framework that learns adaptive search intensity scaling (SIS). Empirical results show SIS yields search-heavy, verifiable answers and enables 7B models to approach DeepSeek-R1's performance on real web tasks; DeepDiver generalizes to open-ended problems and Wiki-based tasks alike. The work also provides a rigorous benchmark and insights into reward design, generalization, and emergent search behaviors for future research.

Abstract

Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.

Paper Structure

This paper contains 63 sections, 7 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Illustration of four key information-seeking behaviors: (a) Evidence Gathering & Supplements (b) Conflict Resolution (c) Verification & Denoising and (d) Reflection & Correction.
  • Figure 2: WebPuzzle pipeline. Above: Candidate Generation: Wiki and open-web pages yield QA pairs via (i) Cross-Page QA and (ii) Riddle pipelines, grouped as Cross-Page QA, Open Riddle, and Wiki Riddle. Below: Difficulty Tagging: Each sample is tagged (easy/medium/hard) for adaptive mixing in RL; DeepDiver is trained on a curated 7k-sample mix.
  • Figure 3: DeepDiver overview. (a) Rollout Generation: DeepDiver iteratively reasons, retrieves evidence, and answers WebPuzzle queries, then receives rewards based on comparison with ground truth. (b) RL Updates: Retrieved text is masked during loss calculation, and the LLM is refined via GRPO using advantages $A_i$ derived from rewards $r_i$.
  • Figure 4: Correlation between reward value and the number of search calls across training phases. The increase in the number of search engine calls is accompanied by a rise in training rewards.
  • Figure 5: The comparison after removing cases answered correctly through internal knowledge.
  • ...and 11 more figures