Table of Contents
Fetching ...

MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

Rui Pu, Chaozhuo Li, Rui Ha, Litian Zhang, Lirong Qiu, Xi Zhang

TL;DR

MirrorShield introduces a dynamic, mirror-based defense against universal jailbreaks by generating syntactically aligned but semantically safe prompts (mirrors) and using RIU, an attention-entropy-based discrepancy metric, to detect and mitigate harmful prompts. The architecture comprises a Mirror Generator (constrained instruction tuning), a Mirror Selector (constraint-based filtering), and an Entropy Defender that refines inputs via multi-query guidance guided by RIU. Empirical results across multiple open LLMs and attack vectors show substantial reductions in attack success rate (ASR) and only modest increases in computation, while preserving strong performance on benign tasks. The work advances practical jailbreak defenses by moving beyond static rules to a dynamic, comparative safety framework with potential applicability to real-world, diverse prompt streams.

Abstract

Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies typically rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules fail to accommodate the inherent complexity and dynamic nature of real-world jailbreak attacks. In this paper, we focus on the novel challenge of universal defense against diverse jailbreaks. We propose a new concept ``mirror'', which is a dynamically generated prompt that reflects the syntactic structure of the input while ensuring semantic safety. The discrepancies between input prompts and their corresponding mirrors serve as guiding principles for defense. A novel defense model, MirrorShield, is further proposed to detect and calibrate risky inputs based on the crafted mirrors. Evaluated on multiple benchmark datasets and compared against ten state-of-the-art attack methods, MirrorShield demonstrates superior defense performance and promising generalization capabilities.

MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

TL;DR

MirrorShield introduces a dynamic, mirror-based defense against universal jailbreaks by generating syntactically aligned but semantically safe prompts (mirrors) and using RIU, an attention-entropy-based discrepancy metric, to detect and mitigate harmful prompts. The architecture comprises a Mirror Generator (constrained instruction tuning), a Mirror Selector (constraint-based filtering), and an Entropy Defender that refines inputs via multi-query guidance guided by RIU. Empirical results across multiple open LLMs and attack vectors show substantial reductions in attack success rate (ASR) and only modest increases in computation, while preserving strong performance on benign tasks. The work advances practical jailbreak defenses by moving beyond static rules to a dynamic, comparative safety framework with potential applicability to real-world, diverse prompt streams.

Abstract

Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies typically rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules fail to accommodate the inherent complexity and dynamic nature of real-world jailbreak attacks. In this paper, we focus on the novel challenge of universal defense against diverse jailbreaks. We propose a new concept ``mirror'', which is a dynamically generated prompt that reflects the syntactic structure of the input while ensuring semantic safety. The discrepancies between input prompts and their corresponding mirrors serve as guiding principles for defense. A novel defense model, MirrorShield, is further proposed to detect and calibrate risky inputs based on the crafted mirrors. Evaluated on multiple benchmark datasets and compared against ten state-of-the-art attack methods, MirrorShield demonstrates superior defense performance and promising generalization capabilities.

Paper Structure

This paper contains 31 sections, 11 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The differences between static discrimination-based defense and our proposed dynamic method.
  • Figure 2: The overview of the proposed MirrorShield model, including the mirror generator, the mirror selector, and the entropy defender via mirror comparison.
  • Figure 3: Comparison of attention entropy with jailbreak attack prompts and harmless prompts.
  • Figure 4: The comparison of RIU under different attack methods across four LLMs.
  • Figure 5: Hyperparameter sensitivity analysis.