DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity

Joey Zhong; Hao Zhang; Clare Southern; Jeremy Yang; Thomas Wang; Kate Jung; Shu Zhang; Denis Yarats; Johnny Ho; Jerry Ma

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity

Joey Zhong, Hao Zhang, Clare Southern, Jeremy Yang, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, Jerry Ma

TL;DR

DRACO introduces a cross-domain benchmark for deep research, built from real production queries to evaluate agentic AI across 100 complex tasks in 10 domains and 40 countries. It combines a rigorous, expert-designed rubric with an open-source LLM-as-a-judge framework to assess factual accuracy, breadth/depth, presentation, and citations, enabling objective, bounded, and challenging evaluation. The paper presents a detailed five-stage task-construction pipeline (sampling, preprocessing, augmentation, filtering, curation) and a multi-stage rubric design process involving domain experts, saturation testing, and final reviews. Experimental results show Perplexity Deep Research leading across domains and axes, with notable gaps in factual accuracy and domain-specific performance, while revealing trade-offs in token usage and latency. The discussion outlines generalization limits (single-turn, static tasks, English-only) and proposes future directions including multi-turn, multimodal tasks, expanded domains/languages, and component-level analysis to improve the measurement of deep research systems’ capabilities and reliability.

Abstract

We present DRACO (Deep Research Accuracy, Completeness, and Objectivity), a benchmark of complex deep research tasks. These tasks, which span 10 domains and draw on information sources from 40 countries, originate from anonymized real-world usage patterns within a large-scale deep research system. Tasks are sampled from a de-identified dataset of Perplexity Deep Research requests, then filtered and augmented to ensure that the tasks are anonymized, open-ended and complex, objectively evaluable, and representative of the broad scope of real-world deep research use cases. Outputs are graded against task-specific rubrics along four dimensions: factual accuracy (accuracy), breadth and depth of analysis (including completeness), presentation quality (including objectivity), and citation quality. DRACO is publicly available at https://hf.co/datasets/perplexity-ai/draco.

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity

TL;DR

Abstract

DRACO: a Cross-Domain Benchmark for Deep Research Accuracy, Completeness, and Objectivity

Authors

TL;DR

Abstract

Table of Contents

Figures (4)