SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Kehua Feng; Xinyi Shen; Weijie Wang; Xiang Zhuang; Yuqi Tang; Qiang Zhang; Keyan Ding

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, Keyan Ding

TL;DR

SciKnowEval presents a five-level framework for evaluating scientific knowledge in LLMs across biology, chemistry, physics, and materials. It builds a large-scale, multi-source 70K-question dataset via literature mining, QA refactoring, and database-to-text transformations, coupled with rigorous quality control. The benchmark reveals that proprietary and large open-source models achieve top performance, yet substantial gaps remain in reasoning and real-world application, especially in higher levels and safety tasks. Incremental domain-focused pretraining and large reasoning models show promise for improving scientific capabilities and safety, positioning SciKnowEval as a potential standard for evaluating and guiding future scientific LLM development.

Abstract

Large language models (LLMs) are playing an increasingly important role in scientific research, yet there remains a lack of comprehensive benchmarks to evaluate the breadth and depth of scientific knowledge embedded in these models. To address this gap, we introduce SciKnowEval, a large-scale dataset designed to systematically assess LLMs across five progressive levels of scientific understanding: memory, comprehension, reasoning, discernment, and application. SciKnowEval comprises 28K multi-level questions and solutions spanning biology, chemistry, physics, and materials science. Using this benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results show that while proprietary models often achieve state-of-the-art performance, substantial challenges remain -- particularly in scientific reasoning and real-world application. We envision SciKnowEval as a standard benchmark for evaluating scientific capabilities in LLMs and as a catalyst for advancing more capable and reliable scientific language models.

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

TL;DR

Abstract

Paper Structure (55 sections, 2 figures, 12 tables)

This paper contains 55 sections, 2 figures, 12 tables.

Introduction
Methods
Design Philosophy
Data Collection Methods
I. Generating New QAs from Literature Corpus
II. Refactoring the Existing QAs
III. Transforming the Scientific Databases
Data Quality Control
Initial screening by LLMs
Human evaluation
Post-screening by LLMs
Overview of the SciKnowEval Dataset
Experiments
Experimental Setup
Evaluation Models.
...and 40 more sections

Figures (2)

Figure 1: Illustration of SciKnowEval. (a) Scientific Domains: Our dataset contains the four subsets of biology, chemistry, material, and physics. (b) Data Sources: We collect our data from various sources, including articles, textbooks, and other sources. (c) Question Types: Our dataset has four types of questions, including relation-extraction questions, multiple-choice questions, content generation, and true/false questions. (d) Five Progressive Levels and Corresponding Examples: We evaluate the LLMs in five ability levels, including their abilities of knowledge memory, comprehension, reasoning, discernment, and application. (e) Question Distribution: The distribution of questions across domains and ability levels.
Figure 2: An illustration of data collection approaches in SciKnowEval, including I) generating new QAs from the literature corpus, II) refactoring the existing QAs, and III) transforming the conventional scientific databases into QAs.

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

TL;DR

Abstract

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)