Table of Contents
Fetching ...

ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

Li S. Yifei, Allen Chang, Chaitanya Malaviya, Mark Yatskar

TL;DR

ResearchQA presents a large-scale, survey-based benchmark for evaluating scholarly QA by mining academic surveys into queries and rubric-based evaluation criteria across 75 fields. It details a multi-stage pipeline—from venue selection and survey article retrieval to query and rubric generation—and provides 21.4K queries and 160K rubric items. Expert validation with Ph.D.-level annotators demonstrates that queries capture researchers’ information needs and that rubrics cover meaningful evaluation criteria, while revealing limitations in current systems’ rubric coverage. The paper also introduces an evaluation protocol and reports results from 18 systems, highlighting the role of retrieval, rubric design, and domain variation in performance. By releasing the dataset and methodology, ResearchQA enables broader, more principled multi-field evaluation of research-synthesis capabilities.

Abstract

Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is abundant: survey articles consolidate knowledge spread across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Queries and rubrics are jointly derived from survey sections, where rubric items list query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. 31 Ph.D. annotators in 8 fields judge that 90% of queries reflect Ph.D. information needs and 87% of rubric items warrant emphasis of a sentence or longer. We leverage ResearchQA to evaluate 18 systems in 7.6K head-to-heads. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.

ResearchQA: Evaluating Scholarly Question Answering at Scale Across 75 Fields with Survey-Mined Questions and Rubrics

TL;DR

ResearchQA presents a large-scale, survey-based benchmark for evaluating scholarly QA by mining academic surveys into queries and rubric-based evaluation criteria across 75 fields. It details a multi-stage pipeline—from venue selection and survey article retrieval to query and rubric generation—and provides 21.4K queries and 160K rubric items. Expert validation with Ph.D.-level annotators demonstrates that queries capture researchers’ information needs and that rubrics cover meaningful evaluation criteria, while revealing limitations in current systems’ rubric coverage. The paper also introduces an evaluation protocol and reports results from 18 systems, highlighting the role of retrieval, rubric design, and domain variation in performance. By releasing the dataset and methodology, ResearchQA enables broader, more principled multi-field evaluation of research-synthesis capabilities.

Abstract

Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is abundant: survey articles consolidate knowledge spread across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Queries and rubrics are jointly derived from survey sections, where rubric items list query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. 31 Ph.D. annotators in 8 fields judge that 90% of queries reflect Ph.D. information needs and 87% of rubric items warrant emphasis of a sentence or longer. We leverage ResearchQA to evaluate 18 systems in 7.6K head-to-heads. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.

Paper Structure

This paper contains 90 sections, 1 equation, 15 figures, 13 tables.

Figures (15)

  • Figure 1: An example ResearchQA query and evaluation rubric. The query, mined from zhou-etal-2024, instructs a research system to generate a long-form answer. An automatic evaluator creates an absolute measure of answer quality via a rubric with up to 8 items. The first rubric item cites razeghi2022termfreq.
  • Figure 2: (Left) ResearchQA generation stages: We identify top-20 venues from each field in Google Scholar, retrieve survey articles from available databases, and generate queries and rubrics from survey sections. Throughout generation, we employ appropriate filtering mechanisms to ensure data quality. (Right) ResearchQA test split field distribution: Queries in the test split span 75 research fields from 7 domains, with high representation in Health Sciences & Medicine, Life & Earth Sciences, and Engineering.
  • Figure 3: ResearchQA query and rubric quality ratings by 31 Ph.D. level experts.
  • Figure 4: A comparison of how much rubrics can aid different evaluators in making predictions that agree with plurality human labels (y-axis) as a function of rubric size (x-axis). All direct judges benefit from integration of rubrics through the hybrid judge, substantially reducing their disagreement with human experts.
  • Figure 5: Rubrics are recent, mostly originating from surveys in the past decade. When cited sources violate date cutoff years, there is little bias on Coverage %.
  • ...and 10 more figures