Table of Contents
Fetching ...

SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

Rima Hazra, Bikram Ghuku, Ilona Marchenko, Yaroslava Tokarieva, Sayan Layek, Somnath Banerjee, Julia Stoyanovich, Mykola Pechenizkiy

Abstract

Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn't reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn "safe/helpful" results can mask systematic tutor failure over extended interaction.

SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

Abstract

Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn't reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn "safe/helpful" results can mask systematic tutor failure over extended interaction.
Paper Structure (44 sections, 4 figures, 10 tables)

This paper contains 44 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of the SafeTutors benchmark construction and evaluation pipeline.
  • Figure 2: Overview of the AI Tutor Risk Taxonomy comprising 11 risk categories and 48 sub-risks, each grounded in established learning sciences theories. Color groups: cognitive/epistemic, motivational/developmental, behavioral/ethical, reflective/relational. Detailed definitions and operationalization criteria are in Appendix \ref{['appn:taxonomy']} and Table \ref{['appn:tabriskdef']}.
  • Figure 3: Single-turn process-level pedagogical analysis across subjects and models. From left to right: Physics, Chemistry, Mathematics.
  • Figure 4: Multi-turn learning-trajectory analysis across subjects and models. From left to right: Physics, Chemistry, Mathematics.