Table of Contents
Fetching ...

The Ever-Evolving Science Exam

Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai

TL;DR

The paper addresses the need for reliable, scalable evaluation of foundation models' scientific understanding while mitigating data leakage and evaluation overhead. It introduces EESE, a two-level benchmark consisting of a large EESE-Pool (>100K items across 5 disciplines and 500+ subfields) and a periodically refreshed 500-instance EESE evaluation set, designed to be leakage-resistant and cost-efficient. The methodology combines a three-stage Data Engine (Transcription, Expansion, Categorization) with a Parallel Three-Branch Refinement (Distraction, Cross-Disciplinary, Expert) to ensure Range, Reach, and Rigor. Experimental results across 32 models show clear discipline-specific strengths and weaknesses, confirm the value of reasoning-enabled approaches, and highlight cost-performance trade-offs, establishing EESE as a practical, forward-looking benchmark blueprint for robust science evaluation in large language models.

Abstract

As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

The Ever-Evolving Science Exam

TL;DR

The paper addresses the need for reliable, scalable evaluation of foundation models' scientific understanding while mitigating data leakage and evaluation overhead. It introduces EESE, a two-level benchmark consisting of a large EESE-Pool (>100K items across 5 disciplines and 500+ subfields) and a periodically refreshed 500-instance EESE evaluation set, designed to be leakage-resistant and cost-efficient. The methodology combines a three-stage Data Engine (Transcription, Expansion, Categorization) with a Parallel Three-Branch Refinement (Distraction, Cross-Disciplinary, Expert) to ensure Range, Reach, and Rigor. Experimental results across 32 models show clear discipline-specific strengths and weaknesses, confirm the value of reasoning-enabled approaches, and highlight cost-performance trade-offs, establishing EESE as a practical, forward-looking benchmark blueprint for robust science evaluation in large language models.

Abstract

As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

Paper Structure

This paper contains 13 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of EESE-Pool construction, which adheres to the principles of Range (vast quantity of instances), Reach (diverse field and question format), and Rigor (systematic and rigor data construction). Specifically, EESE-Pool comprises over 100K science question–answer pairs spanning 5 disciplines and over 500 subfields.
  • Figure 2: EESE-Pool Construction Framework. The three-stage Data Engine (Transcription, Expansion, Categorization) with a systematic Data Refinement process ensures large-scale coverage, expert-enriched content, difficulty stratification, and iterative quality improvement, laying a foundation for dynamic, leakage-resilient EESE.
  • Figure 3: Data refinement of EESE-Pool. Candidate instances are systematically improved through three refinement paths: Enhancement by Distraction, Enrichment by Cross-Disciplinary, and Expert-Driven Refinement. This multi-level human involvement strategy effectively raises instance difficulty, ensuring robust and discriminative evaluation.
  • Figure 4: Performance of six leading models evaluated on the EESE-Pool, leveraging over 100K expertly verified instances and comprising more than 600k model inferences (evaluated across 50 representative fields). Each subplot corresponds to a field by its label (such as 'D1-D12', see appendix) and is color-coded by its parent discipline: ETS (blue), NS (purple), AS (orange), SSH (green), and MS (red). Bars from left to right in each subplot represent the average performance for O3, Gemini-2.5-Pro, GPT-4o, DeepSeek-R1, Qwen-2.5-72B-Instruct, and Grok-3.
  • Figure 5: Quick comparison of human performance and top-performing models with thinking on EESE. Each bar group corresponding to the specific discipline represents the scores of Human, O3, Gemini-2.5-Pro, and Grok-4 (from left to right) respectively.
  • ...and 1 more figures