Table of Contents
Fetching ...

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, Keyan Ding

TL;DR

SciKnowEval presents a five-level framework for evaluating scientific knowledge in LLMs across biology, chemistry, physics, and materials. It builds a large-scale, multi-source 70K-question dataset via literature mining, QA refactoring, and database-to-text transformations, coupled with rigorous quality control. The benchmark reveals that proprietary and large open-source models achieve top performance, yet substantial gaps remain in reasoning and real-world application, especially in higher levels and safety tasks. Incremental domain-focused pretraining and large reasoning models show promise for improving scientific capabilities and safety, positioning SciKnowEval as a potential standard for evaluating and guiding future scientific LLM development.

Abstract

Large language models (LLMs) are playing an increasingly important role in scientific research, yet there remains a lack of comprehensive benchmarks to evaluate the breadth and depth of scientific knowledge embedded in these models. To address this gap, we introduce SciKnowEval, a large-scale dataset designed to systematically assess LLMs across five progressive levels of scientific understanding: memory, comprehension, reasoning, discernment, and application. SciKnowEval comprises 28K multi-level questions and solutions spanning biology, chemistry, physics, and materials science. Using this benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results show that while proprietary models often achieve state-of-the-art performance, substantial challenges remain -- particularly in scientific reasoning and real-world application. We envision SciKnowEval as a standard benchmark for evaluating scientific capabilities in LLMs and as a catalyst for advancing more capable and reliable scientific language models.

SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models

TL;DR

SciKnowEval presents a five-level framework for evaluating scientific knowledge in LLMs across biology, chemistry, physics, and materials. It builds a large-scale, multi-source 70K-question dataset via literature mining, QA refactoring, and database-to-text transformations, coupled with rigorous quality control. The benchmark reveals that proprietary and large open-source models achieve top performance, yet substantial gaps remain in reasoning and real-world application, especially in higher levels and safety tasks. Incremental domain-focused pretraining and large reasoning models show promise for improving scientific capabilities and safety, positioning SciKnowEval as a potential standard for evaluating and guiding future scientific LLM development.

Abstract

Large language models (LLMs) are playing an increasingly important role in scientific research, yet there remains a lack of comprehensive benchmarks to evaluate the breadth and depth of scientific knowledge embedded in these models. To address this gap, we introduce SciKnowEval, a large-scale dataset designed to systematically assess LLMs across five progressive levels of scientific understanding: memory, comprehension, reasoning, discernment, and application. SciKnowEval comprises 28K multi-level questions and solutions spanning biology, chemistry, physics, and materials science. Using this benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results show that while proprietary models often achieve state-of-the-art performance, substantial challenges remain -- particularly in scientific reasoning and real-world application. We envision SciKnowEval as a standard benchmark for evaluating scientific capabilities in LLMs and as a catalyst for advancing more capable and reliable scientific language models.
Paper Structure (55 sections, 2 figures, 12 tables)

This paper contains 55 sections, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Illustration of SciKnowEval. (a) Scientific Domains: Our dataset contains the four subsets of biology, chemistry, material, and physics. (b) Data Sources: We collect our data from various sources, including articles, textbooks, and other sources. (c) Question Types: Our dataset has four types of questions, including relation-extraction questions, multiple-choice questions, content generation, and true/false questions. (d) Five Progressive Levels and Corresponding Examples: We evaluate the LLMs in five ability levels, including their abilities of knowledge memory, comprehension, reasoning, discernment, and application. (e) Question Distribution: The distribution of questions across domains and ability levels.
  • Figure 2: An illustration of data collection approaches in SciKnowEval, including I) generating new QAs from the literature corpus, II) refactoring the existing QAs, and III) transforming the conventional scientific databases into QAs.