Table of Contents
Fetching ...

Deep Research Bench: Evaluating AI Web Research Agents

FutureSearch, :, Nikos I. Bosse, Jon Evans, Robert G. Gambee, Daniel Hnyk, Peter Mühlbacher, Lawrence Phillips, Dan Schwarz, Jack Wildman

TL;DR

Deep Research Bench addresses the lack of time-stable evaluations for AI web research agents by using RetroSearch, a frozen web corpus, to benchmark 89 multi-step tasks across 8 categories. The authors evaluate a broad set of models (including thinking and non-thinking variants) and nine commercial web research products, using a ReAct-style agent loop and automated trace analysis to quantify hallucinations, tool use, and forgetting. They report that frontier models have made meaningful progress but fall short of human performance on the hardest tasks, with offline RetroSearch results largely mirroring live web performance. The work provides a public leaderboard and a scalable framework for ongoing evaluation as models and web content evolve, while acknowledging limitations and avenues for future improvements.

Abstract

Amongst the most common use cases of modern AI is LLM chat with web search enabled. However, no direct evaluations of the quality of web research agents exist that control for the continually-changing web. We introduce Deep Research Bench, consisting of 89 multi-step web research task instances of varying difficulty across 8 diverse task categories, with the answers carefully worked out by skilled humans. We provide a "RetroSearch" environment with a large frozen set of scraped web pages, and demonstrate that offline "RetroSearch" agents perform comparably to "live web" agents, enabling reliable evaluations of models over time. We provide robust agent tooling and scaffolding to benchmark major LLMs as they are released, including "thinking" models like o3 and Gemini 2.5 Pro. We include automated evaluations of the lengthy agent traces to report progress over time in hallucinations, tool use, and forgetting. Finally, we evaluate the major web research products branded as "Deep Research", "Deep Search", "Search", or "Research." Results are available on a public leaderboard at https://drb.futuresearch.ai/.

Deep Research Bench: Evaluating AI Web Research Agents

TL;DR

Deep Research Bench addresses the lack of time-stable evaluations for AI web research agents by using RetroSearch, a frozen web corpus, to benchmark 89 multi-step tasks across 8 categories. The authors evaluate a broad set of models (including thinking and non-thinking variants) and nine commercial web research products, using a ReAct-style agent loop and automated trace analysis to quantify hallucinations, tool use, and forgetting. They report that frontier models have made meaningful progress but fall short of human performance on the hardest tasks, with offline RetroSearch results largely mirroring live web performance. The work provides a public leaderboard and a scalable framework for ongoing evaluation as models and web content evolve, while acknowledging limitations and avenues for future improvements.

Abstract

Amongst the most common use cases of modern AI is LLM chat with web search enabled. However, no direct evaluations of the quality of web research agents exist that control for the continually-changing web. We introduce Deep Research Bench, consisting of 89 multi-step web research task instances of varying difficulty across 8 diverse task categories, with the answers carefully worked out by skilled humans. We provide a "RetroSearch" environment with a large frozen set of scraped web pages, and demonstrate that offline "RetroSearch" agents perform comparably to "live web" agents, enabling reliable evaluations of models over time. We provide robust agent tooling and scaffolding to benchmark major LLMs as they are released, including "thinking" models like o3 and Gemini 2.5 Pro. We include automated evaluations of the lengthy agent traces to report progress over time in hallucinations, tool use, and forgetting. Finally, we evaluate the major web research products branded as "Deep Research", "Deep Search", "Search", or "Research." Results are available on a public leaderboard at https://drb.futuresearch.ai/.

Paper Structure

This paper contains 58 sections, 4 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Comparison of traditional ReAct approach and the modified approach for reasoning models.
  • Figure 2: System architecture of Deep Research Bench using RetroSearch. This shows the flow from task definition through the scraping pipeline that populates the RetroSearch database prior to running the benchmark, and then how agents use RetroSearch via an API at the time of task evaluation.
  • Figure 3: Scores across tasks and LLMs on the full set of 89 instances. Thinking models (which use the implicit-thought ReAct architecture) are colored in dark blue and non-thinking models (which use the regular explicit-thought ReAct architecture) in light blue.
  • Figure 4: Average scores for Live and Retro variants of the ReAct agents for each LLM
  • Figure 5: Scores for Live and Retro variants of the ReAct agents. For each task, the upper histogram shows the score distribution across all instances and LLMs. The lower histogram shows the difference between the Live and Retro scores. We calculate the mean score across all repeats of an instance and LLM before taking the difference. This makes the difference insensitive to the order of the scores.
  • ...and 8 more figures