Table of Contents
Fetching ...

Intention Analysis Makes LLMs A Good Jailbreak Defender

Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao

TL;DR

Facing rising jailbreak risks in LLMs, this paper introduces Intention Analysis ($\mathbb{IA}$), an inference-only two-stage defense that first infers the user's essential safety/ethics/legality intent and then generates a policy-aligned response. IA significantly reduces harmful outputs across diverse jailbreak methods and model families while preserving general helpfulness, achieving substantial attack-success-rate reductions without requiring safety training. The mechanism shifts attention away from jailbreak prompts toward user intent, with robustness to imperfect intention signals and a cost-efficient one-pass variant. The work demonstrates IA's strong practical potential as a plug-and-play safety enhancement and outlines avenues for future improvement and integration with broader alignment efforts.

Abstract

Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). $\mathbb{IA}$ works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses reveal the underlying principle of $\mathbb{IA}$: suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.

Intention Analysis Makes LLMs A Good Jailbreak Defender

TL;DR

Facing rising jailbreak risks in LLMs, this paper introduces Intention Analysis (), an inference-only two-stage defense that first infers the user's essential safety/ethics/legality intent and then generates a policy-aligned response. IA significantly reduces harmful outputs across diverse jailbreak methods and model families while preserving general helpfulness, achieving substantial attack-success-rate reductions without requiring safety training. The mechanism shifts attention away from jailbreak prompts toward user intent, with robustness to imperfect intention signals and a cost-efficient one-pass variant. The work demonstrates IA's strong practical potential as a plug-and-play safety enhancement and outlines avenues for future improvement and integration with broader alignment efforts.

Abstract

Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis (). works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our , Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, is robust to errors in generated intentions. Further analyses reveal the underlying principle of : suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.
Paper Structure (59 sections, 3 equations, 18 figures, 12 tables)

This paper contains 59 sections, 3 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: Performance of our method on different LLMs. Our $\mathbb{IA}$ 1) reduces Attack Success Rate ($\downarrow$) against both crafted jailbreak prompts (DAN and DeepInception) and automatic attack (GCG), 2) achieves remarkable safety improvements for both SFT (Vicuna-7B & MPT-30B-Chat) and RLHF (GPT-3.5) LLMs.
  • Figure 2: Illustrated Comparison of (a) vanilla and (b) the proposed $\mathbb{IA}$. Our $\mathbb{IA}$ consists of two stages: (1) Essential Intention Analysis: instructing the language model to analyse the intention of the user query with an emphasis on safety, ethics, and legality; (2) Policy-Aligned Response: eliciting the final response aligned with safety policy, building upon the analyzed intention from the first stage.
  • Figure 3: The confusion matrix illustrating the relationship between the success of intention analysis and the harmlessness of LLM's final response on SAP200 and DAN datasets. "IR Succ." and "IR Fail." represent success or failure of intention analysis, respectively.
  • Figure 4: Performance of $\mathbb{IA}$ with varying correct intention ratio on DAN dataset. From left to right: the correct intentions are replaced with masked and random intention, respectively.
  • Figure 5: Comparison of Vicuna-13B’s attention scores on different prompt components of different methods. The average attention score is computed on DAN dataset. $\mathbb{IA}$ largely decreases model’s attention to jailbreak prompt (red bar) in both two stages.
  • ...and 13 more figures