Table of Contents
Fetching ...

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Fan Liu, Zhao Xu, Hao Liu

TL;DR

The paper tackles jailbreak vulnerabilities in LLMs by proposing a two-stage adversarial tuning framework that generates adversarial prompts to expose worst-case behavior and then fine-tunes models to respond safely. It combines hierarchical meta-universal token-level prompt learning with automatic prompt refinement to defend against both known and unknown jailbreak attacks without extra filtering. Empirical results across multiple datasets and model families show significant reductions in jailbreak success and improved transferability of the defense. The approach demonstrates robustness to various attack strategies and offers a scalable defense mechanism for real-world LLM safety. Overall, adversarial tuning emerges as a promising, transferable technique for enhancing LLM safety against jailbreaks.

Abstract

Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts, further enhancing LLM's defense capabilities. We conducted comprehensive experiments on three widely used jailbreak datasets, comparing our framework with six defense baselines under five representative attack scenarios. The results underscore the superiority of our proposed methods. Furthermore, our adversarial tuning framework exhibits empirical generalizability across various attack strategies and target LLMs, highlighting its potential as a transferable defense mechanism.

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

TL;DR

The paper tackles jailbreak vulnerabilities in LLMs by proposing a two-stage adversarial tuning framework that generates adversarial prompts to expose worst-case behavior and then fine-tunes models to respond safely. It combines hierarchical meta-universal token-level prompt learning with automatic prompt refinement to defend against both known and unknown jailbreak attacks without extra filtering. Empirical results across multiple datasets and model families show significant reductions in jailbreak success and improved transferability of the defense. The approach demonstrates robustness to various attack strategies and offers a scalable defense mechanism for real-world LLM safety. Overall, adversarial tuning emerges as a promising, transferable technique for enhancing LLM safety against jailbreaks.

Abstract

Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts, further enhancing LLM's defense capabilities. We conducted comprehensive experiments on three widely used jailbreak datasets, comparing our framework with six defense baselines under five representative attack scenarios. The results underscore the superiority of our proposed methods. Furthermore, our adversarial tuning framework exhibits empirical generalizability across various attack strategies and target LLMs, highlighting its potential as a transferable defense mechanism.
Paper Structure (43 sections, 2 theorems, 26 equations, 12 figures, 6 tables, 2 algorithms)

This paper contains 43 sections, 2 theorems, 26 equations, 12 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

When using the universal adversarial suffix $\mathbf{u}$ as the initial adversarial suffix, the optimization process starting from $\mathbf{u}$ requires fewer iterations than starting from initial zero point, and it can speedup about $\frac{\mathcal{L}_{0}- \mathcal{L}_{\text{min}}}{\mathcal{L}_{\ma

Figures (12)

  • Figure 1: Framework overview.
  • Figure 2: Transferability comparison of adversarial fine-tuning datasets across different LLMs.
  • Figure 3: Effect of MUAS.
  • Figure 4: Effect of two-stage AT under prompt-level attack.
  • Figure 5: Effect of two-stage AT under token-level attack.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Proof 1