E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

Jinchang Hou; Chang Ao; Haihong Wu; Xiangtao Kong; Zhigang Zheng; Daijia Tang; Chengming Li; Xiping Hu; Ruifeng Xu; Shiwen Ni; Min Yang

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kong, Zhigang Zheng, Daijia Tang, Chengming Li, Xiping Hu, Ruifeng Xu, Shiwen Ni, Min Yang

TL;DR

The E-EVAL is introduced, the first comprehensive evaluation benchmark specifically designed for the Chinese K-12 education field, and aims to analyze the strengths and limitations of LLMs in educational applications, and to contribute to the progress and development of Chinese K-12 education and LLMs.

Abstract

With the accelerating development of Large Language Models (LLMs), many LLMs are beginning to be used in the Chinese K-12 education domain. The integration of LLMs and education is getting closer and closer, however, there is currently no benchmark for evaluating LLMs that focuses on the Chinese K-12 education domain. Therefore, there is an urgent need for a comprehensive natural language processing benchmark to accurately assess the capabilities of various LLMs in the Chinese K-12 education domain. To address this, we introduce the E-EVAL, the first comprehensive evaluation benchmark specifically designed for the Chinese K-12 education field. The E-EVAL consists of 4,351 multiple-choice questions at the primary, middle, and high school levels across a wide range of subjects, including Chinese, English, Politics, History, Ethics, Physics, Chemistry, Mathematics, and Geography. We conducted a comprehensive evaluation of E-EVAL on advanced LLMs, including both English-dominant and Chinese-dominant models. Findings show that Chinese-dominant models perform well compared to English-dominant models, with many scoring even above the GPT 4.0. However, almost all models perform poorly in complex subjects such as mathematics. We also found that most Chinese-dominant LLMs did not achieve higher scores at the primary school level compared to the middle school level. We observe that the mastery of higher-order knowledge by the model does not necessarily imply the mastery of lower-order knowledge as well. Additionally, the experimental results indicate that the Chain of Thought (CoT) technique is effective only for the challenging science subjects, while Few-shot prompting is more beneficial for liberal arts subjects. With E-EVAL, we aim to analyze the strengths and limitations of LLMs in educational applications, and to contribute to the progress and development of Chinese K-12 education and LLMs.

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

TL;DR

Abstract

Paper Structure (19 sections, 8 figures, 11 tables)

This paper contains 19 sections, 8 figures, 11 tables.

Introduction
The E-EVAL Evaluation Benchmark
Design Principle
Data Collection
E-EVAL Arts and Science
Evaluation
Experiment
Setup
Prompt
Models
Main Results
Insight and Analysis
Related Work
Discussion and Conclusion
Detailed Stats of E-EVAL
...and 4 more sections

Figures (8)

Figure 1: Overview diagram of the E-EVAL benchmark.
Figure 2: A development example with explanations from E-EVAL. English translations are provided beneath the relevant Chinese text.
Figure 3: An example with five-shot-ao from E-EVAL. The red part is the response from model, English translations are provided beneath the relevant Chinese text.
Figure 4: An example with five-shot-cot from E-EVAL. The red part is the response from model, English translations are provided beneath the relevant Chinese text.
Figure 5: A simple primary school math problem with the predictions of the top-3 models.
...and 3 more figures

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

TL;DR

Abstract

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)