Table of Contents
Fetching ...

DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong

TL;DR

DeepDive tackles the open LLM gap in deep search by coupling automated knowledge-graph–driven data synthesis with end-to-end multi-turn reinforcement learning. The approach generates challenging, multi-hop QA pairs from KGs and trains LLMs to reason across iterative searches while discouraging redundant queries. Empirical results show DeepDive-32B achieving competitive open-source performance on BrowseComp and benefits from RL in deeper search strategies and test-time tool use, with additional gains from semi-automated i.i.d. data. The work provides open-source datasets, models, and code, highlighting practical pathways to scale deep search via tool usage and diverse exploration, while acknowledging limitations relative to top proprietary systems.

Abstract

Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs' long-horizon reasoning with deep search. To encourage diversity and reduce redundancy, we design a redundancy penalty that discourages repeated similar queries. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.

DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

TL;DR

DeepDive tackles the open LLM gap in deep search by coupling automated knowledge-graph–driven data synthesis with end-to-end multi-turn reinforcement learning. The approach generates challenging, multi-hop QA pairs from KGs and trains LLMs to reason across iterative searches while discouraging redundant queries. Empirical results show DeepDive-32B achieving competitive open-source performance on BrowseComp and benefits from RL in deeper search strategies and test-time tool use, with additional gains from semi-automated i.i.d. data. The work provides open-source datasets, models, and code, highlighting practical pathways to scale deep search via tool usage and diverse exploration, while acknowledging limitations relative to top proprietary systems.

Abstract

Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs' long-horizon reasoning with deep search. To encourage diversity and reduce redundancy, we design a redundancy penalty that discourages repeated similar queries. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.

Paper Structure

This paper contains 37 sections, 9 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Left: Adding the redundancy penalty reduces tool call counts during RL training. Middle: DeepDive drives the model's deep search ability with maximum tool calls, which improves performance on BrowseComp. Right: Multi-turn reinforcement learning consistently enhances DeepDive-32B on four deep search benchmarks.
  • Figure 2: An illustrative example of BrowseComp wei2025browsecomp questions, which often demand long-horizon reasoning and deep search integration across multiple blurry entities.
  • Figure 3: Overview of automated question–answer (QA) data synthesis from knowledge graphs (KGs) for DeepDive. Deep search QA pairs are automatically constructed by performing random walks over a knowledge graph and subsequently obfuscated using a large language model.
  • Figure 4: Overview of multi-turn RL in DeepDive.
  • Figure 5: Evaluation accuracy and tool calls during RL training on a random subset (BrowseComp-266).
  • ...and 4 more figures