EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education
Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, Yidan Liang
TL;DR
EduEval addresses the need for rigorous evaluation of LLMs in Chinese education by introducing a hierarchical benchmark grounded in the EduAbility Taxonomy (Bloom + DOK + Ethics). It combines authentic data sources across primary to high school levels and 24 task types, enabling cross-cutting assessment of memory, understanding, application, reasoning, creativity, and ethics. Evaluating 14 models under zero- and few-shot settings reveals strong recall but weak performance on authentic classroom tasks and complex reasoning, with open-source models sometimes exceeding proprietary systems on education-specific reasoning. The results inform targeted curriculum-based pre-training and architecture improvements, and the authors propose expanding EduEval with multimodal tasks and finer metrics. Overall, EduEval offers a scalable, authentic, and theory-grounded framework to guide the development of LLMs that better support Chinese K-12 education.
Abstract
Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom's Taxonomy and Webb's Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks.
