Table of Contents
Fetching ...

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation

Yuwei Wan, Yixuan Liu, Aswathy Ajith, Clara Grazian, Bram Hoex, Wenjie Zhang, Chunyu Kit, Tong Xie, Ian Foster

TL;DR

SciQAG presents a two-component framework that automatically generates open-ended science QA pairs from scholarly articles and filters them with RACAR-based evaluation. It builds SciQAG-24D from 22,743 papers across 24 domains, totaling 188,042 QA pairs, and demonstrates that fine-tuning LLMs on this dataset improves performance on open-ended science QA tasks and related scientific benchmarks. Zero-shot experiments show strong performance for commercial models, while finetuning a lightweight open-source model yields substantial gains on both the SciQAG-24D test set and downstream scientific tasks. The work provides public datasets, models, and evaluation tools to advance science QA research and encourages broader domain coverage and robust evaluation of reasoning in AI systems.

Abstract

We introduce SciQAG, a novel framework for automatically generating high-quality science question-answer pairs from a large corpus of scientific literature based on large language models (LLMs). SciQAG consists of a QA generator and a QA evaluator, which work together to extract diverse and research-level questions and answers from scientific papers. Utilizing this framework, we construct a large-scale, high-quality, open-ended science QA dataset containing 188,042 QA pairs extracted from 22,743 scientific papers across 24 scientific domains. We also introduce SciQAG-24D, a new benchmark task designed to evaluate the science question-answering ability of LLMs. Extensive experiments demonstrate that fine-tuning LLMs on the SciQAG dataset significantly improves their performance on both open-ended question answering and scientific tasks. To foster research and collaboration, we make the datasets, models, and evaluation codes publicly available, contributing to the advancement of science question answering and developing more interpretable and reasoning-capable AI systems.

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation

TL;DR

SciQAG presents a two-component framework that automatically generates open-ended science QA pairs from scholarly articles and filters them with RACAR-based evaluation. It builds SciQAG-24D from 22,743 papers across 24 domains, totaling 188,042 QA pairs, and demonstrates that fine-tuning LLMs on this dataset improves performance on open-ended science QA tasks and related scientific benchmarks. Zero-shot experiments show strong performance for commercial models, while finetuning a lightweight open-source model yields substantial gains on both the SciQAG-24D test set and downstream scientific tasks. The work provides public datasets, models, and evaluation tools to advance science QA research and encourages broader domain coverage and robust evaluation of reasoning in AI systems.

Abstract

We introduce SciQAG, a novel framework for automatically generating high-quality science question-answer pairs from a large corpus of scientific literature based on large language models (LLMs). SciQAG consists of a QA generator and a QA evaluator, which work together to extract diverse and research-level questions and answers from scientific papers. Utilizing this framework, we construct a large-scale, high-quality, open-ended science QA dataset containing 188,042 QA pairs extracted from 22,743 scientific papers across 24 scientific domains. We also introduce SciQAG-24D, a new benchmark task designed to evaluate the science question-answering ability of LLMs. Extensive experiments demonstrate that fine-tuning LLMs on the SciQAG dataset significantly improves their performance on both open-ended question answering and scientific tasks. To foster research and collaboration, we make the datasets, models, and evaluation codes publicly available, contributing to the advancement of science question answering and developing more interpretable and reasoning-capable AI systems.
Paper Structure (31 sections, 2 equations, 6 figures, 5 tables)

This paper contains 31 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The SciQAG framework for generating science QA from the scientific literature. The dashed line represents optional fine-tuning.
  • Figure 2: Spearman and Pearson correlations between GPT-4 assigned scores and expert-annotated scores.
  • Figure 3: Proportion of papers of different categories in the training and testing of the SciQAG-24D dataset.
  • Figure 4: Pairwise similarities between pairs of 10 questions generated from ivleva2009towards. Lower similarity is indicated by bluer cells, while higher similarity is indicated by redder cells. Scores above 0.7 are marked with red dots.
  • Figure 5: Original distribution of papers from the WoS Core Collection across 24 WoS categories selected from Chemistry, Physics, Materials Science and Energy.
  • ...and 1 more figures