Table of Contents
Fetching ...

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin

TL;DR

This work presents DeepScholar-Bench, a live benchmark and automated evaluation framework for generative research synthesis, derived from recent ArXiv papers to simulate authentic related-work writing tasks. It defines three holistic evaluation dimensions—knowledge synthesis, retrieval quality, and verifiability—and validates an open-source DeepScholar-base pipeline built on LOTUS for efficient live retrieval, synthesis, and grounding. Across a broad set of baselines, including open-source systems, search AIs, and commercial solutions, the results reveal substantial room for improvement, with none exceeding a 19% overall score, while DeepScholar-base provides a strong competitive baseline and up to 6.3x higher verifiability. The work offers a scalable, evergreen benchmark approach and a practical baseline to accelerate progress toward robust, verifiable generative research synthesis systems.

Abstract

The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of $19\%$ across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

TL;DR

This work presents DeepScholar-Bench, a live benchmark and automated evaluation framework for generative research synthesis, derived from recent ArXiv papers to simulate authentic related-work writing tasks. It defines three holistic evaluation dimensions—knowledge synthesis, retrieval quality, and verifiability—and validates an open-source DeepScholar-base pipeline built on LOTUS for efficient live retrieval, synthesis, and grounding. Across a broad set of baselines, including open-source systems, search AIs, and commercial solutions, the results reveal substantial room for improvement, with none exceeding a 19% overall score, while DeepScholar-base provides a strong competitive baseline and up to 6.3x higher verifiability. The work offers a scalable, evergreen benchmark approach and a practical baseline to accelerate progress toward robust, verifiable generative research synthesis systems.

Abstract

The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.

Paper Structure

This paper contains 35 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Open-source systems, Search AIs, and DeepScholar-base, each with Llama-4-Scout-17B-16E-Instruct model.
  • Figure 2: Search AIs with proprietary models (o3, Claude-opus-4, Gemini-2.5-pro), OpenAI DeepResearch, and DeepScholar-base (GPT4.1, Claude-opus-4)
  • Figure 4: DeepScholarBench Overview. To curate our dataset with real and challenging research tasks, we scrape recent, high-quality ArXiv papers from diverse domains, and extracting key attributes from each paper through an automated data pipeline that can easily be re-run. Our dataset task is to generate a related works section given information about a paper, such as it's title and abstract. The DeepScholar-bench evaluation framework then holistically measures performance of generated reports on three key dimensions: knowledge synthesis, retrieval quality and verifiability.
  • Figure 5: Overview of DeepScholar-base. The system iteratively writes queries and performs web search, before passing the search results through series of semantic operators using the LOTUS system for LLM-based data-processing, including a filtering step to discard irrelevant sources, a top-k ranking step to re-rank the most relevant sources, and a final aggregation step to generate the final report from all remaining sources.
  • Figure 6: DeepScholar-bench dataset schema.
  • ...and 4 more figures