KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs
Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, René Vidal
TL;DR
KDA addresses scalability and diversity limits of jailbreak research by distilling strategies from multiple SOTA attackers into a single open-source attacker. It finetunes Vicuna-13B using LoRA on a labeled dataset generated from AutoDAN, PAIR, and GPTFuzzer across four attack formats, conditioned on harmful queries. The resulting KDA achieves high attack success rates across open-source and commercial LLMs (e.g., $ASR$ up to $100\%$ on Vicuna, Qwen, and Mistral; $88.5\%$ on Llama-2-7B-Chat; $83.5\%$ on Llama-2-13B-Chat) and generalizes to unseen datasets, while increasing efficiency and diversity via an ensemble-format approach. The authors release both the KDA model and its training data to promote reproducibility and defense development, with ablations showing format ensembling and topic diversity as key drivers.
Abstract
Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.
