Table of Contents
Fetching ...

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi

TL;DR

<3-5 sentence high-level summary>OpenScholar introduces a specialized retrieval-augmented language model designed to synthesize scientific literature with citation-backed outputs by retrieving passages from a large open-access datastore and iteratively refining answers via self-feedback. It is evaluated against a new large-scale benchmark, ScholarQABench, which emphasizes realistic, multi-paper literature reviews across four domains and combines automated metrics with expert human judgments. Results show OpenScholar-8B and OpenScholar-GPT4o achieve state-of-the-art performance, outperforming GPT-4o and proprietary systems on correctness and citation accuracy while offering substantial cost advantages. The work provides open-source code, models, and data to enable reproducible evaluation and further development in scientific literature synthesis."

Abstract

Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

TL;DR

<3-5 sentence high-level summary>OpenScholar introduces a specialized retrieval-augmented language model designed to synthesize scientific literature with citation-backed outputs by retrieving passages from a large open-access datastore and iteratively refining answers via self-feedback. It is evaluated against a new large-scale benchmark, ScholarQABench, which emphasizes realistic, multi-paper literature reviews across four domains and combines automated metrics with expert human judgments. Results show OpenScholar-8B and OpenScholar-GPT4o achieve state-of-the-art performance, outperforming GPT-4o and proprietary systems on correctness and citation accuracy while offering substantial cost advantages. The work provides open-source code, models, and data to enable reproducible evaluation and further development in scientific literature synthesis."

Abstract

Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.

Paper Structure

This paper contains 97 sections, 1 equation, 20 figures, 23 tables.

Figures (20)

  • Figure 1: (Top) Overview of OpenScholar: OpenScholar consists of a specialized datastore, retrievers and LMs and iteratively improves responses using self-feedback inference with retrieval. (Middle) Overview of $\textsc{ScholarQABench}$: $\textsc{ScholarQABench}$ consists of 2.2k expert-written questions across multiple scientific disciplines, and we introduce automatic and human evaluation protocols for $\textsc{ScholarQABench}$. (Bottom) Automatic and Human Evaluation Results: Experimental results show the effectiveness of $\textsc{ScholarQABench}$, and that OpenScholar with our trained 8B or GPT4o significantly outperforms other systems, and is preferred over experts over 50% of the time in human evaluations.
  • Figure 2: Detailed overview of OpenScholar inference (top) and training (bottom). At inference time, given an input $x$, OpenScholar first uses a retriever to identify relevant papers from a specialized datastore (OpenScholar-Datastore), and then uses a reranker to refine and identify the top $N$ retrieved documents. The retrieved output is then passed to the LM, which generates both an (1) initial response $y_0$ and (2) self-feedback $f_1$. By incorporating its own feedback, the LM iteratively refines its output a pre-defined number of times. Subsequently, an LM (1) generates initial response $y_0$, (2) generates self-feedback on the initial output, and (3) incorporate feedback ($f_i$) to generates an updated response $y_1$. The LM repeats the process until all feedback is incorporated. To train a smaller yet competitive 8B LM, we generate high-quality training data using this inference-time pipeline followed by data filtering and mixing.
  • Figure 3: An ScholarQA-CS example and evaluation overview.ScholarQA-CS consists of 100 questions and an average of 4.4 expert-written rubrics to be satisfied. Our $\textsc{ScholarQABench}$ evaluation pipeline evaluates aspects like correctness and citation accuracy.
  • Figure 4: Analysis on OpenScholar: (a) Ablation studies for key components of OpenScholar training and inference based on different underlying LMs. (b) Top N docs: Analysis of the effect of varying the number of context chunks for final downstream tasks. We evaluate final model performance based on citation accuracy and correctness on multi-doc QA tasks, using OpenScholar 8B and Llama 3.1 8B.
  • Figure 5: Fine-grained evaluation results. Score distributions between 1) GPT4o (top), OpenScholar with 8B (middle), OpenScholar with GPT4o with Human (bottom).
  • ...and 15 more figures