W&D:Scaling Parallel Tool Calling for Efficient Deep Research Agents
Xiaoqiang Lin, Jun Hao Liew, Silvio Savarese, Junnan Li
TL;DR
The paper tackles the gap in scaling deep research agents by examining width alongside depth through parallel tool calling within a single reasoning step. The Wide and Deep framework demonstrates that increasing per-turn tool calls can boost accuracy and cut the number of interaction turns on benchmarks like BrowseComp, while also reducing wall-clock time and API costs. Through empirical studies, the authors identify three drivers of improvement—broader source exploration, tool-output redundancy for verification, and effective query decomposition—and show that a Descending tool-call scheduler further enhances performance. The work highlights the potential of dynamic width-depth management for high-efficiency agents, while also acknowledging current LLM limitations in autonomously optimizing this trade-off and suggesting reinforcement learning as a promising future direction.
Abstract
Deep research agents have emerged as powerful tools for automating complex intellectual tasks through multi-step reasoning and web-based information seeking. While recent efforts have successfully enhanced these agents by scaling depth through increasing the number of sequential thinking and tool calls, the potential of scaling width via parallel tool calling remains largely unexplored. In this work, we propose the Wide and Deep research agent, a framework designed to investigate the behavior and performance of agents when scaling not only depth but also width via parallel tool calling. Unlike existing approaches that rely on complex multi-agent orchestration to parallelize workloads, our method leverages intrinsic parallel tool calling to facilitate effective coordination within a single reasoning step. We demonstrate that scaling width significantly improves performance on deep research benchmarks while reducing the number of turns required to obtain correct answers. Furthermore, we analyze the factors driving these improvements through case studies and explore various tool call schedulers to optimize parallel tool calling strategy. Our findings suggest that optimizing the trade-off between width and depth is a critical pathway toward high-efficiency deep research agents. Notably, without context management or other tricks, we obtain 62.2% accuracy with GPT-5-Medium on BrowseComp, surpassing the original 54.9% reported by GPT-5-High.
