Table of Contents
Fetching ...

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

David Wadden, Kejian Shi, Jacob Morrison, Alan Li, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, Arman Cohan

TL;DR

SciRIFF introduces a large, expert-crafted instruction-following resource for scientific literature, unifying 54 tasks across five domains with long-context inputs and structured outputs. It pairs the dataset with SciRIFF-Eval, a 4.1K-held-out benchmark to measure true out-of-distribution generalization, and shows that finetuning LLMs on SciRIFF yields substantial improvements (avg. ~70.6% over general-domain baselines) and meaningful gains on information extraction and evidence-grounding tasks. The work demonstrates the importance of expert templates and grounded evaluation for science-focused instruction-following, while also revealing limitations in summarization and the need for careful balancing of data sources. It provides a release of datasets, evaluation suites, model checkpoints, and code to enable reproducible research and practical tooling for researchers navigating the expanding scientific literature.

Abstract

We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF is unique in being entirely expert-written, high-quality instruction-following dataset for extracting and synthesizing information from research literature across diverse scientific fields. It features complex instructions with long input contexts, detailed task descriptions, and structured outputs. To demonstrate its utility, we finetune a series of large language models (LLMs) using a mix of general-domain and SciRIFF instructions. On nine out-of-distribution held-out tasks (referred to as SciRIFF-Eval), LLMs finetuned on SciRIFF achieve 70.6% average improvement over baselines trained only on general-domain instructions. SciRIFF facilitates the development and evaluation of LLMs to help researchers navigate the rapidly growing body of scientific literature.

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

TL;DR

SciRIFF introduces a large, expert-crafted instruction-following resource for scientific literature, unifying 54 tasks across five domains with long-context inputs and structured outputs. It pairs the dataset with SciRIFF-Eval, a 4.1K-held-out benchmark to measure true out-of-distribution generalization, and shows that finetuning LLMs on SciRIFF yields substantial improvements (avg. ~70.6% over general-domain baselines) and meaningful gains on information extraction and evidence-grounding tasks. The work demonstrates the importance of expert templates and grounded evaluation for science-focused instruction-following, while also revealing limitations in summarization and the need for careful balancing of data sources. It provides a release of datasets, evaluation suites, model checkpoints, and code to enable reproducible research and practical tooling for researchers navigating the expanding scientific literature.

Abstract

We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF is unique in being entirely expert-written, high-quality instruction-following dataset for extracting and synthesizing information from research literature across diverse scientific fields. It features complex instructions with long input contexts, detailed task descriptions, and structured outputs. To demonstrate its utility, we finetune a series of large language models (LLMs) using a mix of general-domain and SciRIFF instructions. On nine out-of-distribution held-out tasks (referred to as SciRIFF-Eval), LLMs finetuned on SciRIFF achieve 70.6% average improvement over baselines trained only on general-domain instructions. SciRIFF facilitates the development and evaluation of LLMs to help researchers navigate the rapidly growing body of scientific literature.
Paper Structure (41 sections, 13 figures, 8 tables)

This paper contains 41 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Example SciRIFF tasks. Given an input context from a research paper, the text promptinstructs an LLM to perform an operation on the input---e.g. determine whether the abstractentails a scientific claim, extract information over the full_text, answer a question, etc. The model's outputmust conform to a task-specific, user-specified structure. SciRIFF unifies 54 scientific literature understanding tasks under a common input / output format, enabling the development of LLMs that can flexibly generalize to novel scientific use cases.
  • Figure 2: SciRIFF: pie charts show dataset counts and brackets indicate instance totals for task categories/domains.
  • Figure 3: Performance on SciRIFF-Eval vs. $n_{sci}$ (instances/task). Gains saturate at $n_{sci}=1000$ (see §\ref{['subsec:training_settings']})
  • Figure 4: Overview of SciRIFF dataset. Dashed black lines indicate that a task is included in SciRIFF-Eval and held out during model training. Scientific domains are colored as follows: $\textcolor{Biomed}{\blacksquare}$Biomedicine;$\textcolor{AI}{\blacksquare}$AI;$\textcolor{Clinic}{\blacksquare}$Clinical Medicine;$\textcolor{Chem}{\blacksquare}$Chemistry;$\textcolor{Mat}{\blacksquare}$Materials Science;$\textcolor{Misc}{\blacksquare}$Miscellaneous.
  • Figure 5: Distribution of input (left) and output (right) token lengths over SciRIFF training instances.
  • ...and 8 more figures