Table of Contents
Fetching ...

Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

Shiyu Xiang, Ansen Zhang, Yanfei Cao, Yang Fan, Ronghao Chen

TL;DR

EDDF introduces an essence-driven defense against jailbreak attacks in LLMs by shifting from surface-pattern detection to offline extraction of attack essences and online retrieval-based filtering. The framework constructs an offline Essence Vector Database by parsing attack strategies into concise essences and validating them, then detects adversarial queries online via query abstraction, cosine-based retrieval, and a fine-grained, few-shot judgment. Empirical results show EDDF significantly reduces attack success rates (at least 20% relative) and maintains low false positives on benign inputs, with demonstrated compatibility across model scales and initial multimodal extensions. Limitations include reliance on a growing offline essence corpus and the need for real-time updates and broader model validation.

Abstract

Although Aligned Large Language Models (LLMs) are trained to refuse harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying "attack essence" remains the same. To address this issue, we introduce EDDF, an \textbf{E}ssence-\textbf{D}riven \textbf{D}efense \textbf{F}ramework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the "attack essence" from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20\%, underscoring its superior robustness against jailbreak attacks.

Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

TL;DR

EDDF introduces an essence-driven defense against jailbreak attacks in LLMs by shifting from surface-pattern detection to offline extraction of attack essences and online retrieval-based filtering. The framework constructs an offline Essence Vector Database by parsing attack strategies into concise essences and validating them, then detects adversarial queries online via query abstraction, cosine-based retrieval, and a fine-grained, few-shot judgment. Empirical results show EDDF significantly reduces attack success rates (at least 20% relative) and maintains low false positives on benign inputs, with demonstrated compatibility across model scales and initial multimodal extensions. Limitations include reliance on a growing offline essence corpus and the need for real-time updates and broader model validation.

Abstract

Although Aligned Large Language Models (LLMs) are trained to refuse harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying "attack essence" remains the same. To address this issue, we introduce EDDF, an \textbf{E}ssence-\textbf{D}riven \textbf{D}efense \textbf{F}ramework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the "attack essence" from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20\%, underscoring its superior robustness against jailbreak attacks.

Paper Structure

This paper contains 30 sections, 6 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of three defense methods under Original Dataset and Jailbreak Proliferation (Aligned Model, Other Defences, and EDDF). (Left) In the original dataset, the aligned model (e.g., GPT-4) fails to defend, while other defenses and EDDF succeed. (Right) In the Jailbreak Proliferation dataset, where the attack surface pattern shifts significantly while the attack essence remains similar, the aligned model and other defenses both fail, but EDDF successfully defends.
  • Figure 2: Overview of EDDF. (Top) Offline Essence Database Construction: we extract the underlying "attack essence" from a diverse set of known attack instances and store these essence representations in an offline vector database. (Bottom) Online Adversarial Query Detection: When a new user query is received, the framework identifies and defends against attacks through user query abstraction, essence vector retrieval, and Fine-Grained Judgment.
  • Figure 3: Overview of Online Adversarial Query Detection: When a user query is received, our pipeline runs the complete defense mechanism process, including intermediate outputs.
  • Figure 4: Comparison of our EADD and seven baselines under eight jailbreak methods in terms of ASR (%) and FPR (%) with qwen plus as the target model.
  • Figure 5: Prompt for User Query Abstraction in Our Essence-Aware Framework
  • ...and 3 more figures