Table of Contents
Fetching ...

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, Bing Liu

TL;DR

This paper presents ResearchRubrics, a comprehensive rubric-based benchmark for evaluating Deep Research agents across diverse domains; it formalizes a tri-axial task complexity framework and pairs 101 prompts with 2,593 expert-authored criteria evaluated via LLM judges. It documents a thorough data-collection pipeline and human-guided rubric design, and demonstrates that leading DR systems fall short of rigorous rubric adherence, highlighting gaps in implicit reasoning and multi-source synthesis. The findings argue for architectural advances rather than incremental prompt engineering and provide open resources to accelerate progress toward trustworthy deep research assistants. Overall, ResearchRubrics offers a scalable, human-aligned framework to systematically diagnose and drive improvements in long-form, evidence-backed research agents.

Abstract

Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini's DR and OpenAI's DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

TL;DR

This paper presents ResearchRubrics, a comprehensive rubric-based benchmark for evaluating Deep Research agents across diverse domains; it formalizes a tri-axial task complexity framework and pairs 101 prompts with 2,593 expert-authored criteria evaluated via LLM judges. It documents a thorough data-collection pipeline and human-guided rubric design, and demonstrates that leading DR systems fall short of rigorous rubric adherence, highlighting gaps in implicit reasoning and multi-source synthesis. The findings argue for architectural advances rather than incremental prompt engineering and provide open resources to accelerate progress toward trustworthy deep research assistants. Overall, ResearchRubrics offers a scalable, human-aligned framework to systematically diagnose and drive improvements in long-form, evidence-backed research agents.

Abstract

Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini's DR and OpenAI's DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.

Paper Structure

This paper contains 40 sections, 3 equations, 21 figures, 10 tables.

Figures (21)

  • Figure 1: Overview of ResearchRubrics and its evaluation pipeline.
  • Figure 2: The three-stage pipeline for creating and refining prompts and rubrics. An initial draft by Expert 1 is iteratively improved with Expert 2 before a final review and adjustment by Expert 3.
  • Figure 3: Distribution of task domains in our collected data.
  • Figure 4: Overview of task complexity dimensions and rubric criteria category distributions in ResearchRubrics.
  • Figure 5: Rubric-axis failure rates across Deep Research agents. Dark bars represent ternary grading; light bars show binary grading. Implicit reasoning and synthesis show markedly higher failure rates compared to communication quality and references. The pattern holds across all three systems, indicating architectural rather than implementation limitations.
  • ...and 16 more figures