Table of Contents
Fetching ...

GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

Zhikai Lei, Tianyi Liang, Hanglei Hu, Jin Zhang, Yunhua Zhou, Yunfan Shao, Linyang Li, Chenchui Li, Changbo Wang, Hang Yan, Qipeng Guo

TL;DR

This paper questions whether high scores on standard LLM benchmarks truly reflect human-like capabilities. It introduces GAOKAO-Eval, a comprehensive Gaokao-based benchmark with strict closed-book evaluation, non-leaking data, and expert grading to better approximate human testing. Using Rasch modeling, it uncovers semi difficulty-invariant scoring and high variance in LLM responses, along with grading inconsistencies, indicating a mismatch between scores and true capabilities. The authors demonstrate that incorporating reasoning tokens as proxies for task difficulty can mitigate the mismatch, highlighting the need for LLM-aligned difficulty analysis in benchmark design.

Abstract

Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.

GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

TL;DR

This paper questions whether high scores on standard LLM benchmarks truly reflect human-like capabilities. It introduces GAOKAO-Eval, a comprehensive Gaokao-based benchmark with strict closed-book evaluation, non-leaking data, and expert grading to better approximate human testing. Using Rasch modeling, it uncovers semi difficulty-invariant scoring and high variance in LLM responses, along with grading inconsistencies, indicating a mismatch between scores and true capabilities. The authors demonstrate that incorporating reasoning tokens as proxies for task difficulty can mitigate the mismatch, highlighting the need for LLM-aligned difficulty analysis in benchmark design.

Abstract

Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.

Paper Structure

This paper contains 25 sections, 4 equations, 13 figures, 22 tables.

Figures (13)

  • Figure 1: Comparison of LLM scores on the first and last questions of Gaokao Paper. Despite the latter being more difficult, LLMs achieve similar scores, revealing potential inconsistencies.
  • Figure 2: The GAOKAO-Eval pipeline. Built on the Gaokao benchmark, which ensures balanced difficulty and subject coverage, GAOKAO-Eval evaluates models released before the exam date under strict closed-book conditions, with human teachers grading subjective responses. Findings show that, even with high scores, LLMs have inconsistent scoring patterns and greater variation on tasks of similar difficulty. In contrast, human performance changes more predictably with task difficulty.
  • Figure 3: Comprehensiveness of GAOKAO-Eval.
  • Figure 4: Total performance of LLMs in New Curriculum Standard Paper and National Type A Paper. $+VL$: questions involving images will use the corresponding multimodal version of the model for inference.
  • Figure 5: Consistency distribution of Elo ratings across different models and methods, demonstrating alignment with human expert difficulty ratings.
  • ...and 8 more figures