Table of Contents
Fetching ...

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang, Xiaoling Wang, Liang He

TL;DR

The paper tackles the safety challenges of deploying educational LLMs by introducing EduHarm, a domain-specific benchmark for safe–unsafe instructions across five educational scenarios, and proposes a unified defense framework (TSSF) to counter both jailbreak and fine-tuning attacks. TSSF combines safety-aware attention realignment, layer-wise safety judgment, and defense-driven dual routing to restore harmfulness signals, detect unsafe inputs across layers, and adaptively route queries through safe or guarded pathways. Empirical results across multiple models and eight jailbreak strategies, plus three fine-tuning datasets, show that TSSF substantially reduces attack success rates while preserving benign task performance and keeping inference overhead within practical limits. The work provides a practical, scalable approach to educational LLM safety with strong generalization across architectures and attack types, holding significant potential for safer AI-assisted learning environments.

Abstract

Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

TL;DR

The paper tackles the safety challenges of deploying educational LLMs by introducing EduHarm, a domain-specific benchmark for safe–unsafe instructions across five educational scenarios, and proposes a unified defense framework (TSSF) to counter both jailbreak and fine-tuning attacks. TSSF combines safety-aware attention realignment, layer-wise safety judgment, and defense-driven dual routing to restore harmfulness signals, detect unsafe inputs across layers, and adaptively route queries through safe or guarded pathways. Empirical results across multiple models and eight jailbreak strategies, plus three fine-tuning datasets, show that TSSF substantially reduces attack success rates while preserving benign task performance and keeping inference overhead within practical limits. The work provides a practical, scalable approach to educational LLM safety with strong generalization across architectures and attack types, holding significant potential for safer AI-assisted learning environments.

Abstract

Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.

Paper Structure

This paper contains 26 sections, 15 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: A unified defense objective against fine-tuning and jailbreak attacks to ensure consistent refusal of harmful queries. Fine-tuning attacks inject a small portion of harmful data into benign datasets, whereas jailbreak attacks rely on adversarial prompts (e.g., GCG) that attempt to bypass the model’s safety mechanisms.
  • Figure 2: The construction process of the educational safety evaluation dataset, EduHarm.
  • Figure 3: Hidden state tendencies of jailbreak strategies (ArtPromt and Pair) for harmful instructions at two token positions, $x_\text{inst}$ and $x_\text{post\_inst}$. Instructions are classified into successfully executed (accepted harmful) and rejected (refused harmful) cases. Hidden states are projected onto the refusal and acceptance clusters, where positive $s^l(h^l_{*})$ values indicate stronger alignment with refusal clusters, and negative values indicate stronger alignment with acceptance clusters, highlighting how hidden states differentiate successful jailbreaks from rejected instructions.
  • Figure 4: Hidden state tendencies of harmful instructions at two token positions ($x_\text{inst}$ and $x_\text{post\_inst}$) for models fine-tuned on either a normal dataset or a dataset containing 10% harmful data. Instructions are classified into successfully compliant responses (accepted harmful) and rejected responses (refused harmful), and their hidden states are analyzed relative to the refusal and acceptance clusters. Positive values of $s^l(h^l_{*})$ indicate a stronger alignment with refusal clusters, whereas negative values indicate a stronger alignment with acceptance clusters.
  • Figure 5: Overview of TSSF, a three-stage defense framework for LLMs that mitigates both jailbreak and fine-tuning attacks in educational applications through three steps: Safety-Aware Attention Realignment, Layer-Wise Safety Judgment, and Defense-Driven Dual Routing.
  • ...and 4 more figures