Table of Contents
Fetching ...

OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education

Min Zhang, Hao Chen, Hao Chen, Wenqi Zhang, Didi Zhu, Xin Lin, Bo Jiang, Aimin Zhou, Fei Wu, Kun Kuang

TL;DR

OmniEduBench tackles the gap in evaluating LLMs for education by introducing a native Chinese benchmark with two dimensions: knowledge and cultivation. It aggregates 24.602K QA pairs across 61 subjects and 11 question types, plus a HARD subset to stress-test reasoning and pedagogy. Through a rigorous construction pipeline and dual-metric evaluation, the study reveals substantial gaps among state-of-the-art models, with Gemini-2.5 Pro leading in knowledge and QWQ in cultivation, while HARD samples pose significant challenges. The work provides a principled framework for holistic educational assessment and points to future directions in richer cultivation tasks and multimodal scenarios.

Abstract

With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs' capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60\% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30\%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.

OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education

TL;DR

OmniEduBench tackles the gap in evaluating LLMs for education by introducing a native Chinese benchmark with two dimensions: knowledge and cultivation. It aggregates 24.602K QA pairs across 61 subjects and 11 question types, plus a HARD subset to stress-test reasoning and pedagogy. Through a rigorous construction pipeline and dual-metric evaluation, the study reveals substantial gaps among state-of-the-art models, with Gemini-2.5 Pro leading in knowledge and QWQ in cultivation, while HARD samples pose significant challenges. The work provides a principled framework for holistic educational assessment and points to future directions in richer cultivation tasks and multimodal scenarios.

Abstract

With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided into 6 fine-grained categories, covering a total of 61 different subjects (41 in the knowledge and 20 in the cultivation). Furthermore, the dataset features a rich variety of question formats, including 11 common exam question types, providing a solid foundation for comprehensively evaluating LLMs' capabilities in education. Extensive experiments on 11 mainstream open-source and closed-source LLMs reveal a clear performance gap. In the knowledge dimension, only Gemini-2.5 Pro surpassed 60\% accuracy, while in the cultivation dimension, the best-performing model, QWQ, still trailed human intelligence by nearly 30\%. These results highlight the substantial room for improvement and underscore the challenges of applying LLMs in education.

Paper Structure

This paper contains 17 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Overview of OmniEduBench. The benchmark comprises two dimensions: 41 subjects across six categories in the knowledge, and 20 subjects across six categories in the cultivation.
  • Figure 2: Overview of the construction process, including collection, cleaning, filtering, verification.
  • Figure 3: Example of (a) a single-choice question in the knowledge from a college chemist. (b) A single-choice question in the cultivation. English translations are shown for better readability.
  • Figure 4: Example of (a) a multiple-choice question in the knowledge from Biology. (b) A short-answer question in the knowledge from Math. English translations are shown for better readability.
  • Figure 5: Zero-shot average accuracy (%) on the knowledge dimension of OmniEduBench HARD.
  • ...and 3 more figures