EduEVAL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations
Javier Irigoyen, Roberto Daza, Aythami Morales, Julian Fierrez, Francisco Jurado, Alvaro Ortigosa, Ruben Tolosana
TL;DR
EduEVAL-DB addresses the need for safe, pedagogy-focused evaluation of explanations in K–12 contexts by introducing a risk-based dataset built from ScienceQA questions. It combines 854 explanations across 139 questions generated by six LLM-simulated teacher roles and a human teacher, annotated with a five-dimension pedagogical risk rubric. The work defines a structured evaluation protocol and demonstrates that a lightweight model can be effectively fine-tuned to detect pedagogical risks, while large frontier models maintain strength in factual assessment. This dataset and protocol enable safer, more education-aligned AI tutors and evaluators, with potential for deployment on consumer hardware and future multimodal extensions.
Abstract
This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.
