Table of Contents
Fetching ...

DRBench: A Realistic Benchmark for Enterprise Deep Research

Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Varshini Ramesh, Étienne Marcotte, Xing Han Lù, Nicolas Chapados, Spandana Gella, Christopher Pal, Alexandre Drouin, Issam H. Laradji

TL;DR

DRBench tackles enterprise deep research by benchmarking autonomous agents that synthesize insights from both public web sources and private company data. The benchmark couples realistic personas with a self-hosted, multi-application environment to evaluate multi-step, long-horizon research tasks, measuring Insight Recall, Factuality, and Report Quality. Across backbone models and planning strategies, results show robust distractor avoidance but imperfect extraction of decision-critical insights, with adaptive planning and larger models offering the strongest gains. These findings highlight key directions for improving enterprise deep research agents, including better integration of private data, adaptive planning, and reliable grounding to sourced evidence.

Abstract

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.

DRBench: A Realistic Benchmark for Enterprise Deep Research

TL;DR

DRBench tackles enterprise deep research by benchmarking autonomous agents that synthesize insights from both public web sources and private company data. The benchmark couples realistic personas with a self-hosted, multi-application environment to evaluate multi-step, long-horizon research tasks, measuring Insight Recall, Factuality, and Report Quality. Across backbone models and planning strategies, results show robust distractor avoidance but imperfect extraction of decision-critical insights, with adaptive planning and larger models offering the strongest gains. These findings highlight key directions for improving enterprise deep research agents, including better integration of private data, adaptive planning, and reliable grounding to sourced evidence.

Abstract

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.

Paper Structure

This paper contains 67 sections, 42 figures, 26 tables.

Figures (42)

  • Figure 1: DRBench pipeline. The Task Context defines the deep research question grounded by the company and persona given to the agent. Task Data, including both distractor and injected groundtruth insights in different formats (PDFs, DOCX, PPTX, XLSX, chats, etc.) are loaded into the enterprise environment's applications. The DRBench Agent accesses both public web sources and local enterprise data to extract relevant insights for the research question. It produces a structured research report, which is evaluated for Insight Recall (detecting injected groundtruth insights), Factuality (verifying claims are correctly cited), and Report Quality.
  • Figure 2: DRBench Task Generation Pipeline. The pipeline comprises five main stages during each LLMs generate candidate data such as company context, insights, and research questions, while human annotators verify quality and select the final version. Stages S1–S5 denote the five generation steps.
  • Figure 3: DRBench Agent architecture showing the enterprise research workflow from question submission through iterative research cycles to final report generation, using both enterprise and web search capabilities.
  • Figure 4: t-SNE visualization of QA pairs for the DR Question in Task DR0005. The plot shows the distribution of annotated pairs across Supporting Insights (green), Distractors (red), and the central Deep Research (DR) Question (gold star). Out of 49 pairs, 16 correspond to supporting insights and 33 are distractors. The visualization illustrates how relevant insights cluster separately from distractors, highlighting the challenge of retrieving salient information in a distractor-heavy environment.
  • Figure 5: Example files with injected insights in DRBench.
  • ...and 37 more figures