Table of Contents
Fetching ...

ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs

Ziyi Ni, Hao Wang, Huacan Wang

TL;DR

ShieldLearner introduces a human cognition-inspired, parameter-free defense against jailbreak attacks in LLMs. It distills attack signatures into a Pattern Atlas and defense heuristics into a Meta-analysis Framework, enabling explainable and reusable defenses, while Adaptive Adversarial Augmentation continually challenges defenses without retraining. Empirical results show ShieldLearner outperforms baselines on both standard and hard jailbreak benchmarks and reduces computational overhead. The approach promises faster adaptation to evolving threats and supports community standardization through explicit, auditable defenses. Future work will broaden dangerous-sample coverage and domain-specific generation paths for targeted deployment.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various domains but remain vulnerable to adversarial jailbreak attacks. Existing prompt-defense strategies, including parameter-modifying and parameter-free approaches, face limitations in adaptability, interpretability, and customization, constraining their effectiveness against evolving threats. To address these challenges, we propose ShieldLearner, a novel paradigm that mimics human learning in defense. Through trial and error, it autonomously distills attack signatures into a Pattern Atlas and synthesizes defense heuristics into a Meta-analysis Framework, enabling systematic and interpretable threat detection. Furthermore, we introduce Adaptive Adversarial Augmentation to generate adversarial variations of successfully defended prompts, enabling continuous self-improvement without model retraining. In addition to standard benchmarks, we create a hard test set by curating adversarial prompts from the Wildjailbreak dataset, emphasizing more concealed malicious intent. Experimental results show that ShieldLearner achieves a significantly higher defense success rate than existing baselines on both conventional and hard test sets, while also operating with lower computational overhead, making it a practical and efficient solution for real-world adversarial defense.

ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs

TL;DR

ShieldLearner introduces a human cognition-inspired, parameter-free defense against jailbreak attacks in LLMs. It distills attack signatures into a Pattern Atlas and defense heuristics into a Meta-analysis Framework, enabling explainable and reusable defenses, while Adaptive Adversarial Augmentation continually challenges defenses without retraining. Empirical results show ShieldLearner outperforms baselines on both standard and hard jailbreak benchmarks and reduces computational overhead. The approach promises faster adaptation to evolving threats and supports community standardization through explicit, auditable defenses. Future work will broaden dangerous-sample coverage and domain-specific generation paths for targeted deployment.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various domains but remain vulnerable to adversarial jailbreak attacks. Existing prompt-defense strategies, including parameter-modifying and parameter-free approaches, face limitations in adaptability, interpretability, and customization, constraining their effectiveness against evolving threats. To address these challenges, we propose ShieldLearner, a novel paradigm that mimics human learning in defense. Through trial and error, it autonomously distills attack signatures into a Pattern Atlas and synthesizes defense heuristics into a Meta-analysis Framework, enabling systematic and interpretable threat detection. Furthermore, we introduce Adaptive Adversarial Augmentation to generate adversarial variations of successfully defended prompts, enabling continuous self-improvement without model retraining. In addition to standard benchmarks, we create a hard test set by curating adversarial prompts from the Wildjailbreak dataset, emphasizing more concealed malicious intent. Experimental results show that ShieldLearner achieves a significantly higher defense success rate than existing baselines on both conventional and hard test sets, while also operating with lower computational overhead, making it a practical and efficient solution for real-world adversarial defense.

Paper Structure

This paper contains 29 sections, 7 figures, 5 tables, 2 algorithms.

Figures (7)

  • Figure 1: The overview of ShieldLearner. Our novel prompt-defense paradigm against jailbreak attacks. Our goal is to defend against harmful content concealed by different jailbreak attacks, which serve as jailbreak samples. During the self-learning phase, adversarial attacks continuously enhance these jailbreak samples to challenge the existing defense mechanism and create more difficult samples. We learn to recognize and extract patterns into the Pattern Atlas, while iteratively refining our defense analysis framework. These are then used in the testing phase.
  • Figure 2: Example of a Pattern signature.
  • Figure 3: Example of an analytical principle.
  • Figure 4: Illustration of the test phase.
  • Figure 5: Performance of ShieldLearner with varying numbers of training data in framework.
  • ...and 2 more figures