Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models
Yuyan Chen, Chenwei Wu, Songzhou Yan, Panjun Liu, Haoyu Zhou, Yanghua Xiao
TL;DR
This paper frames large language models as educators and introduces Dr.Academy, a benchmark that assesses their ability to generate high-quality educational questions using Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. It constructs large-scale, taxonomy-aligned contexts (from SQuAD and MMLU) and applies three task setups to evaluate question generation with four metrics (consistency, relevance, coverage, representativeness), validated by expert judgments and aligned with human scoring. The experiments across 11 LLMs reveal GPT-4 as the strongest all-around teacher in general and monodisciplinary domains, with Claude2 excelling in interdisciplinarity; automatic GPT-4-based scoring shows high correlation with human judgments, supporting the viability of automatic evaluation for teaching-style capabilities. The work provides a foundational framework for evaluating LLM teaching abilities, highlights domain-specific strengths, and points to future work in refining metrics and expanding domain coverage to more fully capture teaching effectiveness beyond question generation.
Abstract
Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.
