Table of Contents
Fetching ...

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

Jun Leng, Litian Zhang, Xi Zhang

TL;DR

MAAG introduces an immunology-inspired, memory-enabled multi-agent framework for adaptive jailbreak detection in LLMs. By combining a memory-based immune detection stage with post-generation response simulation and a memory-update loop, MAAG rapidly recognizes known threats and generalizes to novel attacks without parameter tuning. Empirical results across five open-source models and six attack types demonstrate state-of-the-art performance, robustness to unseen prompts, and explainable decision-making via a four-stage workflow. The approach offers practical benefits for real-world LLM safety by enabling continual learning and reducing retraining costs in adversarial environments.

Abstract

Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks. These attacks involve carefully crafted prompts that bypass safety guardrails and induce models to produce harmful content. Detecting such malicious input queries is therefore critical for maintaining LLM safety. Existing methods for jailbreak detection typically involve fine-tuning LLMs as static safety LLMs using fixed training datasets. However, these methods incur substantial computational costs when updating model parameters to improve robustness, especially in the face of novel jailbreak attacks. Inspired by immunological memory mechanisms, we propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection. The core idea is to equip guard with memory capabilities: upon encountering novel jailbreak attacks, the system memorizes attack patterns, enabling it to rapidly and accurately identify similar threats in future encounters. Specifically, MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection. A defense agent then simulates responses based on these detection results, and an auxiliary agent supervises the simulation process to provide secondary filtering of the detection outcomes. Extensive experiments across five open-source models demonstrate that MAAG significantly outperforms state-of-the-art (SOTA) methods, achieving 98% detection accuracy and a 96% F1-score across a diverse range of attack scenarios.

Immunity memory-based jailbreak detection: multi-agent adaptive guard for large language models

TL;DR

MAAG introduces an immunology-inspired, memory-enabled multi-agent framework for adaptive jailbreak detection in LLMs. By combining a memory-based immune detection stage with post-generation response simulation and a memory-update loop, MAAG rapidly recognizes known threats and generalizes to novel attacks without parameter tuning. Empirical results across five open-source models and six attack types demonstrate state-of-the-art performance, robustness to unseen prompts, and explainable decision-making via a four-stage workflow. The approach offers practical benefits for real-world LLM safety by enabling continual learning and reducing retraining costs in adversarial environments.

Abstract

Large language models (LLMs) have become foundational in AI systems, yet they remain vulnerable to adversarial jailbreak attacks. These attacks involve carefully crafted prompts that bypass safety guardrails and induce models to produce harmful content. Detecting such malicious input queries is therefore critical for maintaining LLM safety. Existing methods for jailbreak detection typically involve fine-tuning LLMs as static safety LLMs using fixed training datasets. However, these methods incur substantial computational costs when updating model parameters to improve robustness, especially in the face of novel jailbreak attacks. Inspired by immunological memory mechanisms, we propose the Multi-Agent Adaptive Guard (MAAG) framework for jailbreak detection. The core idea is to equip guard with memory capabilities: upon encountering novel jailbreak attacks, the system memorizes attack patterns, enabling it to rapidly and accurately identify similar threats in future encounters. Specifically, MAAG first extracts activation values from input prompts and compares them to historical activations stored in a memory bank for quick preliminary detection. A defense agent then simulates responses based on these detection results, and an auxiliary agent supervises the simulation process to provide secondary filtering of the detection outcomes. Extensive experiments across five open-source models demonstrate that MAAG significantly outperforms state-of-the-art (SOTA) methods, achieving 98% detection accuracy and a 96% F1-score across a diverse range of attack scenarios.

Paper Structure

This paper contains 20 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of existing methods and our method. Left is the description of existing methods which is static to detect evolving jailbreak attacks. Right is our method which adaptively detect jailbreak attacks.
  • Figure 2: The framework of MAAG operates in three stages. (1)Immune detection extracts hidden states from the LLM as attack states for the request, then retrieves and compares the top-K most similar benign and attack states from the memory bank. (2)Response simulation employs a simulation agent to generate candidate responses while a reflection agent supervises the process based on immune detection results and response content. (3)Memory update stores detection outcomes and attack states to adaptively refine the memory bank.
  • Figure 3: Adaptive study of MAAG.
  • Figure 4: Evaluation on different safety datasets.
  • Figure 5: Case study of MAAG.