SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models
Kehua Feng, Xinyi Shen, Weijie Wang, Xiang Zhuang, Yuqi Tang, Qiang Zhang, Keyan Ding
TL;DR
SciKnowEval presents a five-level framework for evaluating scientific knowledge in LLMs across biology, chemistry, physics, and materials. It builds a large-scale, multi-source 70K-question dataset via literature mining, QA refactoring, and database-to-text transformations, coupled with rigorous quality control. The benchmark reveals that proprietary and large open-source models achieve top performance, yet substantial gaps remain in reasoning and real-world application, especially in higher levels and safety tasks. Incremental domain-focused pretraining and large reasoning models show promise for improving scientific capabilities and safety, positioning SciKnowEval as a potential standard for evaluating and guiding future scientific LLM development.
Abstract
Large language models (LLMs) are playing an increasingly important role in scientific research, yet there remains a lack of comprehensive benchmarks to evaluate the breadth and depth of scientific knowledge embedded in these models. To address this gap, we introduce SciKnowEval, a large-scale dataset designed to systematically assess LLMs across five progressive levels of scientific understanding: memory, comprehension, reasoning, discernment, and application. SciKnowEval comprises 28K multi-level questions and solutions spanning biology, chemistry, physics, and materials science. Using this benchmark, we evaluate 20 leading open-source and proprietary LLMs. The results show that while proprietary models often achieve state-of-the-art performance, substantial challenges remain -- particularly in scientific reasoning and real-world application. We envision SciKnowEval as a standard benchmark for evaluating scientific capabilities in LLMs and as a catalyst for advancing more capable and reliable scientific language models.
