Intention Analysis Makes LLMs A Good Jailbreak Defender
Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao
TL;DR
Facing rising jailbreak risks in LLMs, this paper introduces Intention Analysis ($\mathbb{IA}$), an inference-only two-stage defense that first infers the user's essential safety/ethics/legality intent and then generates a policy-aligned response. IA significantly reduces harmful outputs across diverse jailbreak methods and model families while preserving general helpfulness, achieving substantial attack-success-rate reductions without requiring safety training. The mechanism shifts attention away from jailbreak prompts toward user intent, with robustness to imperfect intention signals and a cost-efficient one-pass variant. The work demonstrates IA's strong practical potential as a plug-and-play safety enhancement and outlines avenues for future improvement and integration with broader alignment efforts.
Abstract
Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). $\mathbb{IA}$ works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses reveal the underlying principle of $\mathbb{IA}$: suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.
