Table of Contents
Fetching ...

EduEVAL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations

Javier Irigoyen, Roberto Daza, Aythami Morales, Julian Fierrez, Francisco Jurado, Alvaro Ortigosa, Ruben Tolosana

TL;DR

EduEVAL-DB addresses the need for safe, pedagogy-focused evaluation of explanations in K–12 contexts by introducing a risk-based dataset built from ScienceQA questions. It combines 854 explanations across 139 questions generated by six LLM-simulated teacher roles and a human teacher, annotated with a five-dimension pedagogical risk rubric. The work defines a structured evaluation protocol and demonstrates that a lightweight model can be effectively fine-tuned to detect pedagogical risks, while large frontier models maintain strength in factual assessment. This dataset and protocol enable safer, more education-aligned AI tutors and evaluators, with potential for deployment on consumer hardware and future multimodal extensions.

Abstract

This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.

EduEVAL-DB: A Role-Based Dataset for Pedagogical Risk Evaluation in Educational Explanations

TL;DR

EduEVAL-DB addresses the need for safe, pedagogy-focused evaluation of explanations in K–12 contexts by introducing a risk-based dataset built from ScienceQA questions. It combines 854 explanations across 139 questions generated by six LLM-simulated teacher roles and a human teacher, annotated with a five-dimension pedagogical risk rubric. The work defines a structured evaluation protocol and demonstrates that a lightweight model can be effectively fine-tuned to detect pedagogical risks, while large frontier models maintain strength in factual assessment. This dataset and protocol enable safer, more education-aligned AI tutors and evaluators, with potential for deployment on consumer hardware and future multimodal extensions.

Abstract

This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.
Paper Structure (11 sections, 3 figures, 1 table)

This paper contains 11 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the framework used in this work for the construction of EduEVAL-DB and pedagogical risk evaluation. The diagram illustrates the main stages of the proposed pipeline: i) the definition of a pedagogical risk rubric; ii) the construction and annotation of EduEVAL-DB, where each K–12 question is paired with multiple explanations generated by LLM-simulated teacher roles and a human teacher, and annotated with binary risk labels; and iii) the use of EduEVAL-DB to benchmark and fine-tune LLM-based pedagogical evaluators.
  • Figure 2: Example explanations generated by the six teacher-inspired roles. Each role generates an instructional explanation for the same question, conditioned on the student’s grade level and the multiple-choice context. To ensure responsible presentation, the Sarcastic Teacher output is blurred in the figure, although the complete explanation is included in the released dataset.
  • Figure 3: Confusion matrices for the Llama 3.1 8B pedagogical evaluator on EduEVAL-DB in zero-shot baseline (top) and fine-tuned (bottom) settings across the five pedagogical risk dimensions defined in the proposed rubric. Each matrix reports the distribution of predicted versus ground-truth binary labels, where 0 denotes the absence of pedagogical risk and 1 denotes its presence (detection). Cell color intensity reflects the proportion of samples in each category, with warmer colors indicating higher frequencies.