Table of Contents
Fetching ...

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

Buyun Liang, Kwan Ho Ryan Chan, Darshan Thaker, Jinqi Luo, René Vidal

TL;DR

KDA addresses scalability and diversity limits of jailbreak research by distilling strategies from multiple SOTA attackers into a single open-source attacker. It finetunes Vicuna-13B using LoRA on a labeled dataset generated from AutoDAN, PAIR, and GPTFuzzer across four attack formats, conditioned on harmful queries. The resulting KDA achieves high attack success rates across open-source and commercial LLMs (e.g., $ASR$ up to $100\%$ on Vicuna, Qwen, and Mistral; $88.5\%$ on Llama-2-7B-Chat; $83.5\%$ on Llama-2-13B-Chat) and generalizes to unseen datasets, while increasing efficiency and diversity via an ensemble-format approach. The authors release both the KDA model and its training data to promote reproducibility and defense development, with ablations showing format ensembling and topic diversity as key drivers.

Abstract

Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.

KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs

TL;DR

KDA addresses scalability and diversity limits of jailbreak research by distilling strategies from multiple SOTA attackers into a single open-source attacker. It finetunes Vicuna-13B using LoRA on a labeled dataset generated from AutoDAN, PAIR, and GPTFuzzer across four attack formats, conditioned on harmful queries. The resulting KDA achieves high attack success rates across open-source and commercial LLMs (e.g., up to on Vicuna, Qwen, and Mistral; on Llama-2-7B-Chat; on Llama-2-13B-Chat) and generalizes to unseen datasets, while increasing efficiency and diversity via an ensemble-format approach. The authors release both the KDA model and its training data to promote reproducibility and defense development, with ablations showing format ensembling and topic diversity as key drivers.

Abstract

Jailbreak attacks exploit specific prompts to bypass LLM safeguards, causing the LLM to generate harmful, inappropriate, and misaligned content. Current jailbreaking methods rely heavily on carefully designed system prompts and numerous queries to achieve a single successful attack, which is costly and impractical for large-scale red-teaming. To address this challenge, we propose to distill the knowledge of an ensemble of SOTA attackers into a single open-source model, called Knowledge-Distilled Attacker (KDA), which is finetuned to automatically generate coherent and diverse attack prompts without the need for meticulous system prompt engineering. Compared to existing attackers, KDA achieves higher attack success rates and greater cost-time efficiency when targeting multiple SOTA open-source and commercial black-box LLMs. Furthermore, we conducted a quantitative diversity analysis of prompts generated by baseline methods and KDA, identifying diverse and ensemble attacks as key factors behind KDA's effectiveness and efficiency.

Paper Structure

This paper contains 28 sections, 5 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: KDA Attack Generation: Overview and Example Prompts. (Top) Schematic overview of the KDA attack generation process. The formats A, P, G, and M correspond to prompts learned from the teacher attackers: AutoDAN, PAIR, GPTFuzzer, and Mixed, respectively. (Bottom) Examples of attack prompts generated by KDA, conditioned on different formats.
  • Figure 2: Schematic overview of KDA training.
  • Figure 3: Comparison of ASR vs. target query budget for KDA and SOTA attack methods. The evaluation is conducted on the Harmful-Behavior dataset chao_jailbreaking_2024. The ASR values in this plot are evaluated using the HB evaluator. Our KDA method employs the format selection strategy $\texttt{trn}$. The curves represent average ASR across different LLM targets, computed over 10,000 bootstrap samples, with shaded regions indicating standard deviations.
  • Figure 4: Ablation study on attack success rate for all KDA format selection strategies. The curves depict $\text{ASR}^{\text{HB}}_{30}$, the attack success rate with a target query budget of $M=30$ using the HB evaluator, comparing single-format settings ($F \in \{\texttt{A}, \texttt{P},\texttt{G}, \texttt{M}\}$) and ensemble-format settings ($\texttt{uni}, \texttt{ifr}, \texttt{trn}$). The evaluation is conducted on the standard behavior dataset from Harmbench mazeika_harmbench_2024. Solid lines represent ensemble formats while dashed lines indicate single formats. Uncertainty is quantified using the standard deviation from 10,000 bootstrap samples drawn with replacement.
  • Figure 5: ASR and Topic Diversity of KDA using single format setting$\text{KDA}$ with format $F \in \{\texttt{A}, \texttt{P},\texttt{G}, \texttt{M}\}$ compared to SOTA baselines AutoDAN, PAIR, and GPTFuzzer. (Top) Attack Success Rate (ASR); (Bottom) Topic Diversity Ratio (TDR). $\text{KDA}_\text{A}$, $\text{KDA}_\text{P}$, and $\text{KDA}_\text{G}$ share the same color scheme as their respective baseline counterparts AutoDAN, PAIR, and GPTFuzzer. The evaluation is conducted on the Harmful-Behavior dataset chao_jailbreaking_2024. Uncertainty is quantified using the standard deviation from 10,000 bootstrap samples drawn with replacement.
  • ...and 4 more figures