Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
Fan Liu, Zhao Xu, Hao Liu
TL;DR
The paper tackles jailbreak vulnerabilities in LLMs by proposing a two-stage adversarial tuning framework that generates adversarial prompts to expose worst-case behavior and then fine-tunes models to respond safely. It combines hierarchical meta-universal token-level prompt learning with automatic prompt refinement to defend against both known and unknown jailbreak attacks without extra filtering. Empirical results across multiple datasets and model families show significant reductions in jailbreak success and improved transferability of the defense. The approach demonstrates robustness to various attack strategies and offers a scalable defense mechanism for real-world LLM safety. Overall, adversarial tuning emerges as a promising, transferable technique for enhancing LLM safety against jailbreaks.
Abstract
Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs' generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts, further enhancing LLM's defense capabilities. We conducted comprehensive experiments on three widely used jailbreak datasets, comparing our framework with six defense baselines under five representative attack scenarios. The results underscore the superiority of our proposed methods. Furthermore, our adversarial tuning framework exhibits empirical generalizability across various attack strategies and target LLMs, highlighting its potential as a transferable defense mechanism.
