ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics
Li S. Yifei, Allen Chang, Chaitanya Malaviya, Mark Yatskar
TL;DR
ResearchQA presents a large-scale, survey-based benchmark for evaluating scholarly QA by mining academic surveys into queries and rubric-based evaluation criteria across 75 fields. It details a multi-stage pipeline—from venue selection and survey article retrieval to query and rubric generation—and provides 21.4K queries and 160K rubric items. Expert validation with Ph.D.-level annotators demonstrates that queries capture researchers’ information needs and that rubrics cover meaningful evaluation criteria, while revealing limitations in current systems’ rubric coverage. The paper also introduces an evaluation protocol and reports results from 18 systems, highlighting the role of retrieval, rubric design, and domain variation in performance. By releasing the dataset and methodology, ResearchQA enables broader, more principled multi-field evaluation of research-synthesis capabilities.
Abstract
Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is abundant: survey articles consolidate knowledge spread across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Queries and rubrics are jointly derived from survey sections, where rubric items list query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. 31 Ph.D. annotators in 8 fields judge that 90% of queries reflect Ph.D. information needs and 87% of rubric items warrant emphasis of a sentence or longer. We leverage ResearchQA to evaluate 18 systems in 7.6K head-to-heads. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.
