Table of Contents
Fetching ...

W&D:Scaling Parallel Tool Calling for Efficient Deep Research Agents

Xiaoqiang Lin, Jun Hao Liew, Silvio Savarese, Junnan Li

TL;DR

The paper tackles the gap in scaling deep research agents by examining width alongside depth through parallel tool calling within a single reasoning step. The Wide and Deep framework demonstrates that increasing per-turn tool calls can boost accuracy and cut the number of interaction turns on benchmarks like BrowseComp, while also reducing wall-clock time and API costs. Through empirical studies, the authors identify three drivers of improvement—broader source exploration, tool-output redundancy for verification, and effective query decomposition—and show that a Descending tool-call scheduler further enhances performance. The work highlights the potential of dynamic width-depth management for high-efficiency agents, while also acknowledging current LLM limitations in autonomously optimizing this trade-off and suggesting reinforcement learning as a promising future direction.

Abstract

Deep research agents have emerged as powerful tools for automating complex intellectual tasks through multi-step reasoning and web-based information seeking. While recent efforts have successfully enhanced these agents by scaling depth through increasing the number of sequential thinking and tool calls, the potential of scaling width via parallel tool calling remains largely unexplored. In this work, we propose the Wide and Deep research agent, a framework designed to investigate the behavior and performance of agents when scaling not only depth but also width via parallel tool calling. Unlike existing approaches that rely on complex multi-agent orchestration to parallelize workloads, our method leverages intrinsic parallel tool calling to facilitate effective coordination within a single reasoning step. We demonstrate that scaling width significantly improves performance on deep research benchmarks while reducing the number of turns required to obtain correct answers. Furthermore, we analyze the factors driving these improvements through case studies and explore various tool call schedulers to optimize parallel tool calling strategy. Our findings suggest that optimizing the trade-off between width and depth is a critical pathway toward high-efficiency deep research agents. Notably, without context management or other tricks, we obtain 62.2% accuracy with GPT-5-Medium on BrowseComp, surpassing the original 54.9% reported by GPT-5-High.

W&D:Scaling Parallel Tool Calling for Efficient Deep Research Agents

TL;DR

The paper tackles the gap in scaling deep research agents by examining width alongside depth through parallel tool calling within a single reasoning step. The Wide and Deep framework demonstrates that increasing per-turn tool calls can boost accuracy and cut the number of interaction turns on benchmarks like BrowseComp, while also reducing wall-clock time and API costs. Through empirical studies, the authors identify three drivers of improvement—broader source exploration, tool-output redundancy for verification, and effective query decomposition—and show that a Descending tool-call scheduler further enhances performance. The work highlights the potential of dynamic width-depth management for high-efficiency agents, while also acknowledging current LLM limitations in autonomously optimizing this trade-off and suggesting reinforcement learning as a promising future direction.

Abstract

Deep research agents have emerged as powerful tools for automating complex intellectual tasks through multi-step reasoning and web-based information seeking. While recent efforts have successfully enhanced these agents by scaling depth through increasing the number of sequential thinking and tool calls, the potential of scaling width via parallel tool calling remains largely unexplored. In this work, we propose the Wide and Deep research agent, a framework designed to investigate the behavior and performance of agents when scaling not only depth but also width via parallel tool calling. Unlike existing approaches that rely on complex multi-agent orchestration to parallelize workloads, our method leverages intrinsic parallel tool calling to facilitate effective coordination within a single reasoning step. We demonstrate that scaling width significantly improves performance on deep research benchmarks while reducing the number of turns required to obtain correct answers. Furthermore, we analyze the factors driving these improvements through case studies and explore various tool call schedulers to optimize parallel tool calling strategy. Our findings suggest that optimizing the trade-off between width and depth is a critical pathway toward high-efficiency deep research agents. Notably, without context management or other tricks, we obtain 62.2% accuracy with GPT-5-Medium on BrowseComp, surpassing the original 54.9% reported by GPT-5-High.
Paper Structure (14 sections, 3 equations, 11 figures, 3 tables)

This paper contains 14 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: (a) Single vs. parallel tool calling in a multi-step deep research agent trace. In parallel tool calling, the model performs a single reasoning step to issue multiple tool calls simultaneously; these calls are executed in parallel and their outputs are returned together into the agent's trace. (b) Top: Performance of different LLMs under parallel tool calling with varying # tool calls per step. Performance consistently improves as the # parallel tool calls increases across all models. Bottom: Average # turns required to complete the task with different # parallel tool calls. Increasing the # tool calls per iteration reduces the total # iterations needed to complete the deep research task.
  • Figure 2: Prompt for controlling the number of tool calls in parallel tool calling.
  • Figure 3: (Left) BrowseComp accuracy against average number of turns. (Right) Accuracy against number of tools per turn.
  • Figure 4: Scaling of tool calls. (Top row) Performance of GPT-5-medium across different benchmarks. (Bottom row) Performance of different models on BrowseComp benchmark.
  • Figure 5: The comparison between single tool calling vs parallel tool calling in using the information source. Parallel tool calling uses more reliable information source.
  • ...and 6 more figures