Table of Contents
Fetching ...

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Bin Xu, Yu Bai, Huashan Sun, Yiguan Lin, Siming Liu, Xinyue Liang, Yaolin Li, Zhuangzhi Dong, Jingren Zhang, Yufan Deng, Xinyu Zou, Yang Gao, Heyan Huang

TL;DR

EduBench introduces a comprehensive, batch-generated benchmark for evaluating LLMs in diverse educational scenarios, encompassing 9 educational domains and over 4,000 contexts with 18,821 data points. It proposes a 12-dimension, pedagogy-focused evaluation framework (three core dimensions with four sub-metrics each) and calibrates LLM-based evaluators against human judgments using a 198-sample test set. The paper demonstrates that smaller models trained with EduBench data can rival state-of-the-art large models through multi-source distillation, and it provides extensive analysis on evaluator consistency and model behavior. This benchmark aims to drive robust, scenario-aware educational AI development and practical deployment in teaching and learning contexts.

Abstract

As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

TL;DR

EduBench introduces a comprehensive, batch-generated benchmark for evaluating LLMs in diverse educational scenarios, encompassing 9 educational domains and over 4,000 contexts with 18,821 data points. It proposes a 12-dimension, pedagogy-focused evaluation framework (three core dimensions with four sub-metrics each) and calibrates LLM-based evaluators against human judgments using a 198-sample test set. The paper demonstrates that smaller models trained with EduBench data can rival state-of-the-art large models through multi-source distillation, and it provides extensive analysis on evaluator consistency and model behavior. This benchmark aims to drive robust, scenario-aware educational AI development and practical deployment in teaching and learning contexts.

Abstract

As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.

Paper Structure

This paper contains 99 sections, 3 figures, 18 tables.

Figures (3)

  • Figure 1: The left section presents our 9 educational scenarios, along with their multi-dimensional educational contexts and corresponding metrics. The right section illustrates the results from human evaluation on EduBench.
  • Figure 2: Overview of EduBench, with data curation on the left, evaluation principles and human–LLM alignment in the middle, and downstream performance gains for smaller models on the right.
  • Figure 3: Workflow of Selecting the Most Human-Aligned LLM and Conducting Full-Scale Evaluation.