Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Fan Liu; Zhao Xu; Hao Liu

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Fan Liu, Zhao Xu, Hao Liu

TL;DR

The paper tackles jailbreak vulnerabilities in LLMs by proposing a two-stage adversarial tuning framework that generates adversarial prompts to expose worst-case behavior and then fine-tunes models to respond safely. It combines hierarchical meta-universal token-level prompt learning with automatic prompt refinement to defend against both known and unknown jailbreak attacks without extra filtering. Empirical results across multiple datasets and model families show significant reductions in jailbreak success and improved transferability of the defense. The approach demonstrates robustness to various attack strategies and offers a scalable defense mechanism for real-world LLM safety. Overall, adversarial tuning emerges as a promising, transferable technique for enhancing LLM safety against jailbreaks.

Abstract

Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts, further enhancing LLM's defense capabilities. We conducted comprehensive experiments on three widely used jailbreak datasets, comparing our framework with six defense baselines under five representative attack scenarios. The results underscore the superiority of our proposed methods. Furthermore, our adversarial tuning framework exhibits empirical generalizability across various attack strategies and target LLMs, highlighting its potential as a transferable defense mechanism.

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

TL;DR

Abstract

Paper Structure (43 sections, 2 theorems, 26 equations, 12 figures, 6 tables, 2 algorithms)

This paper contains 43 sections, 2 theorems, 26 equations, 12 figures, 6 tables, 2 algorithms.

Introduction
Preliminary
Threat Model
Problem Statement
Methodology
Hierarchical Meta-Universal Adversarial Tuning
Outer Universal Adversarial Prompt Learning
Inner Individual Adversarial Prompt Learning
Token-level Adversarial Tuning Optimization
Prompt-Level Adversarial Refinement Learning
Theoretical Analysis
Experiments
Experiments Setup
Main Experiments
Transferability of Adversarial Fine-tuning Data
...and 28 more sections

Key Result

Theorem 1

When using the universal adversarial suffix $\mathbf{u}$ as the initial adversarial suffix, the optimization process starting from $\mathbf{u}$ requires fewer iterations than starting from initial zero point, and it can speedup about $\frac{\mathcal{L}_{0}- \mathcal{L}_{\text{min}}}{\mathcal{L}_{\ma

Figures (12)

Figure 1: Framework overview.
Figure 2: Transferability comparison of adversarial fine-tuning datasets across different LLMs.
Figure 3: Effect of MUAS.
Figure 4: Effect of two-stage AT under prompt-level attack.
Figure 5: Effect of two-stage AT under token-level attack.
...and 7 more figures

Theorems & Definitions (3)

Theorem 1
Theorem 2
Proof 1

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

TL;DR

Abstract

Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (3)