PeerQA: A Scientific Question Answering Dataset from Peer Reviews

Tim Baumgärtner; Ted Briscoe; Iryna Gurevych

PeerQA: A Scientific Question Answering Dataset from Peer Reviews

Tim Baumgärtner, Ted Briscoe, Iryna Gurevych

TL;DR

PeerQA introduces a real-world, document-level QA dataset for science by sourcing questions from peer reviews and collecting author-provided answers. The dataset supports three practical tasks—evidence retrieval, answerability classification, and free-form answer generation—and includes 579 labeled QA pairs from 208 papers plus 12k unlabeled questions. Analyses show that decontextualization improves retrieval at the paragraph level and that long-context papers (~12k tokens) pose challenges for generation, though retrieval-augmented generation with top passages often outperforms full-document context. The work provides baselines, a comprehensive analysis, and open-source code and data to drive future research in long-context scientific QA.

Abstract

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens. Our code and data is available at https://github.com/UKPLab/peerqa.

PeerQA: A Scientific Question Answering Dataset from Peer Reviews

TL;DR

Abstract

Paper Structure (78 sections, 14 figures, 16 tables)

This paper contains 78 sections, 14 figures, 16 tables.

Introduction
Related Work
Peer Review
Scientific QA
Long-Context QA
PeerQA
Data Collection
Paper Processing.
Question Processing.
Answer Annotation.
Quality Control
Analysis
Experiments
Answer Evidence Retrieval
Answerability and Answer Generation
...and 63 more sections

Figures (14)

Figure 1: Overview of the PeerQA data collection process. From the peer review process (in green), we extract and process questions from the reviews. Given the published version of the article and a question, an expert (in our case, the original paper authors) (1) checks the question and modifies or discards it, (2) annotates whether it is answerable or not (i.e. if there is sufficient information in the paper), and if so (3) highlights the evidence to answer the question and finally (4) provides a free-form answer to the question.
Figure 2: Statistics of the PeerQA dataset. The color coding shows the distribution per venue and by the scientific community (i.e., blue colors for ML, orange for NLP, green for Geosciences, and purple for mixed). The gray dotted line indicates the average. The leftmost histogram shows a paper distribution, while the others show a distribution of questions. We measure the number of tokens using the Llama-3 tokenizer.
Figure 3: Answerability scores (y-axis) with different contexts (x-axis). In the Gold setting, the model is only provided with the annotated, relevant paragraphs (i.e., no unanswerable questions are available in this setting); in Full Text, the entire paper is provided in the context (and potentially truncated); otherwise, the top-scoring passages by SPLADEv3 are provided. The Precision and Recall plots show the Answerable (- -) and Unanswerable ($\cdot\cdot$) classes.
Figure 4: Rouge-L F1, AlignScore and Prometheus Correctness metrics between the annotated free-form answer (1. column), the GPT-4 augmented answer (2. column), the annotated evidence passages (3. column), and the generated answer.
Figure 5: Empirical cumulative distribution function of the cosine similarity between the processed question and the sentences in the review. Context n refers to the n-th preceding sentence before the raw, unprocessed Review Question. Max. Similarity takes the max operation over these four similarity scores, i.e., reports the similarity the processed question is most similar to.
...and 9 more figures

PeerQA: A Scientific Question Answering Dataset from Peer Reviews

TL;DR

Abstract

PeerQA: A Scientific Question Answering Dataset from Peer Reviews

Authors

TL;DR

Abstract

Table of Contents

Figures (14)