The Ever-Evolving Science Exam
Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai
TL;DR
The paper addresses the need for reliable, scalable evaluation of foundation models' scientific understanding while mitigating data leakage and evaluation overhead. It introduces EESE, a two-level benchmark consisting of a large EESE-Pool (>100K items across 5 disciplines and 500+ subfields) and a periodically refreshed 500-instance EESE evaluation set, designed to be leakage-resistant and cost-efficient. The methodology combines a three-stage Data Engine (Transcription, Expansion, Categorization) with a Parallel Three-Branch Refinement (Distraction, Cross-Disciplinary, Expert) to ensure Range, Reach, and Rigor. Experimental results across 32 models show clear discipline-specific strengths and weaknesses, confirm the value of reasoning-enabled approaches, and highlight cost-performance trade-offs, establishing EESE as a practical, forward-looking benchmark blueprint for robust science evaluation in large language models.
Abstract
As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.
