Table of Contents
Fetching ...

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Yilin Jiang, Mingzi Zhang, Xuanyu Yin, Sheng Jin, Suyu Lu, Zuocan Ying, Zengyi Yu, Xiangjie Kong

TL;DR

EduGuardBench introduces a holistic benchmark to evaluate Teacher SP-LLMs on both pedagogical fidelity and domain-specific safety. It combines SATA-based Teaching Harm assessment with Role-playing Fidelity ($RFS$) and Ethical Flaw analysis, and an adversarial-safety component assessing Attack Success Rate ($ASR$) and Refusal Quality, all under a HITL-driven evaluation. Across 14 models, reasoning-oriented architectures generally show higher pedagogical fidelity but safety vulnerabilities persist, revealing a scaling paradox where mid-sized models can be most vulnerable. A key finding is the Educational Transformation Effect, where the safest models convert harmful requests into teachable moments, strongly negatively correlated with $ASR$, suggesting new directions for safety training and deployment in educational AI.

Abstract

Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

TL;DR

EduGuardBench introduces a holistic benchmark to evaluate Teacher SP-LLMs on both pedagogical fidelity and domain-specific safety. It combines SATA-based Teaching Harm assessment with Role-playing Fidelity () and Ethical Flaw analysis, and an adversarial-safety component assessing Attack Success Rate () and Refusal Quality, all under a HITL-driven evaluation. Across 14 models, reasoning-oriented architectures generally show higher pedagogical fidelity but safety vulnerabilities persist, revealing a scaling paradox where mid-sized models can be most vulnerable. A key finding is the Educational Transformation Effect, where the safest models convert harmful requests into teachable moments, strongly negatively correlated with , suggesting new directions for safety training and deployment in educational AI.

Abstract

Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.

Paper Structure

This paper contains 53 sections, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of the dual challenges for Teacher SP-LLMs. The left panel shows how a simple student query can elicit pedagogically harmful responses categorized as Incompetence, Indolence, or Offensiveness. The right panel demonstrates how a persona-based jailbreak prompt can bypass safety alignments to generate harmful content.
  • Figure 2: Teaching capability analysis: (Left) Error rates by scenario; (Right) Error type distribution by model reasoning capability.
  • Figure 3: Adversarial safety evaluation across different attack categories.
  • Figure 4: The generalized and anonymized meta-prompt. It uses placeholders (e.g., [{academic_stage}]) for dynamic content, and generic terms ("Language A", "Language B") to comply with double-blind review policies. It explicitly mandates the generation of five options (A-E) and directs the LLM to produce a new, bilingual SATA question in the specified JSON format.
  • Figure 5: The meta-prompt designed for the Combinatorial Expansion pipeline. It instructs the LLM on its goal (generating two sets of 13 items), provides context (the target harm dimension), and specifies the exact JSON output structure. The process is guided by few-shot examples that are dynamically injected by the control script at runtime.
  • ...and 6 more figures