Table of Contents
Fetching ...

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

Oluseyi Olukola, Nick Rahimi

Abstract

Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety for educational RL comprising structural, progress, behavioral, and alignment safety and propose the Reward Hacking Severity Index (RHSI) to quantify misalignment between proxy rewards and genuine learning. We evaluate the framework in a controlled simulation of an AI tutoring environment with 120 sessions across four conditions and three learner profiles, totaling 18{,}000 interactions. Results show that an engagement-optimized agent systematically over-selected a high-engagement action with no direct mastery gain, producing strong measured performance but limited learning progress. A multi-objective reward formulation reduced this problem but did not eliminate it, as the agent continued to favor proxy-rewarding behavior in many states. In contrast, a constrained architecture combining prerequisite enforcement and minimum cognitive demand substantially reduced reward hacking, lowering RHSI from 0.317 in the unconstrained multi-objective condition to 0.102. Ablation results further suggest that behavioral safety was the most influential safeguard against repetitive low-value action selection. These findings suggest that reward design alone may be insufficient to ensure pedagogically aligned behavior in educational RL, at least in the simulated environment studied here. More broadly, the paper positions pedagogical safety as an important research problem at the intersection of AI safety and intelligent educational systems.

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

Abstract

Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety for educational RL comprising structural, progress, behavioral, and alignment safety and propose the Reward Hacking Severity Index (RHSI) to quantify misalignment between proxy rewards and genuine learning. We evaluate the framework in a controlled simulation of an AI tutoring environment with 120 sessions across four conditions and three learner profiles, totaling 18{,}000 interactions. Results show that an engagement-optimized agent systematically over-selected a high-engagement action with no direct mastery gain, producing strong measured performance but limited learning progress. A multi-objective reward formulation reduced this problem but did not eliminate it, as the agent continued to favor proxy-rewarding behavior in many states. In contrast, a constrained architecture combining prerequisite enforcement and minimum cognitive demand substantially reduced reward hacking, lowering RHSI from 0.317 in the unconstrained multi-objective condition to 0.102. Ablation results further suggest that behavioral safety was the most influential safeguard against repetitive low-value action selection. These findings suggest that reward design alone may be insufficient to ensure pedagogically aligned behavior in educational RL, at least in the simulated environment studied here. More broadly, the paper positions pedagogical safety as an important research problem at the intersection of AI safety and intelligent educational systems.

Paper Structure

This paper contains 28 sections, 7 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: SmartTutor system architecture. Three layers govern the tutoring loop: the Student Interaction Layer, the SmartTutor Core (five components), and the Safety Monitoring Layer, which evaluates C2--C4 constraint violations and aggregates them into the RHSI.
  • Figure 2: Action distribution across conditions (% of 4,500 interactions per condition). MO selects Encourage at the highest rate of any condition (32.6%), exceeding even EO (25.8%). ST produces the most balanced distribution, with no action exceeding 18.4% and substantially higher usage of high-demand actions than any unconstrained condition.
  • Figure 3: Mean cumulative knowledge level (estimated per-concept mastery averaged across all 27 concepts) over 150 interactions, by condition and learner profile (mean $\pm$ SEM across 10 seeds). Note this shows the knowledge level, not mastery gain $\Delta K$ reported in Table \ref{['tab:learning']}. MAS and ST consistently achieve the highest knowledge levels across all profiles; the gap widens with learner capability. ST matches MAS on mastery trajectories despite maintaining the engagement signal in its reward function.
  • Figure 4: Per-seed RHSI distributions across all conditions ($n = 30$ per condition; 3 profiles $\times$ 10 seeds). Box shows IQR, white line is median, whiskers extend to 1.5$\times$IQR, dots are outliers. ST achieves the lowest and most consistent RHSI. Significance brackets show Bonferroni-corrected comparisons. The dashed line at $\varepsilon = 0.25$ marks the realistic deployment safety threshold discussed in Section \ref{['sec:rhsi']}.
  • Figure 5: Parameter sensitivity: mean RHSI across the $W \times \delta_{\min}$ grid (20 cells). Darker shading indicates higher RHSI (worse safety) for EO and MO; darker green indicates higher RHSI for MAS and ST. EO remains the worst condition in all 20 cells. The dashed border on the ST panel marks the $\delta_{\min} \geq 0.45$ region where MAS edges out ST (see text); outside this region ST achieves the lowest RHSI of any condition.

Theorems & Definitions (4)

  • Definition 3.1: Pedagogical Safety
  • Definition 3.2: $\varepsilon$-Pedagogical Safety
  • Definition 3.3: Educational Reward Hacking
  • Definition 3.4: Reward Hacking Severity Index