Table of Contents
Fetching ...

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

Xiang Li, Chong Zhang, Jia Wang, Fangyu Wu, Yushi Li, Xiaobo Jin

TL;DR

This work tackles the practical limitations of large-language-model jailbreaks by proposing Adversarial Prompt Distillation (APD), which distills LLM jailbreak capabilities into small language models. APD combines LoRA-style fine-tuning with KL-based distribution alignment, a dynamic temperature mechanism for prompt sampling, and reinforcement learning from AI feedback to optimize template selection and prompts. Empirical results show that distilled SLMs achieve high attack success across multiple victim models while reducing generation time and resource usage, demonstrating strong transferability and efficiency gains. The study highlights both the security implications for LLMs and the need for robust defenses, while offering a scalable framework for future jailbreak investigations and defense research.

Abstract

As the scale and complexity of jailbreaking attacks on large language models (LLMs) continue to escalate, their efficiency and practical applicability are constrained, posing a profound challenge to LLM security. Jailbreaking techniques have advanced from manual prompt engineering to automated methodologies. Recent advances have automated jailbreaking approaches that harness LLMs to generate jailbreak instructions and adversarial examples, delivering encouraging results. Nevertheless, these methods universally include an LLM generation phase, which, due to the complexities of deploying and reasoning with LLMs, impedes effective implementation and broader adoption. To mitigate this issue, we introduce \textbf{Adversarial Prompt Distillation}, an innovative framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM jailbreaking prowess into smaller language models (SLMs). This methodology enables efficient, robust jailbreak attacks while maintaining high success rates and accommodating a broader range of application contexts. Empirical evaluations affirm the approach's superiority in attack efficacy, resource optimization, and cross-model versatility. Our research underscores the practicality of transferring jailbreak capabilities to SLMs, reveals inherent vulnerabilities in LLMs, and provides novel insights to advance LLM security investigations. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

TL;DR

This work tackles the practical limitations of large-language-model jailbreaks by proposing Adversarial Prompt Distillation (APD), which distills LLM jailbreak capabilities into small language models. APD combines LoRA-style fine-tuning with KL-based distribution alignment, a dynamic temperature mechanism for prompt sampling, and reinforcement learning from AI feedback to optimize template selection and prompts. Empirical results show that distilled SLMs achieve high attack success across multiple victim models while reducing generation time and resource usage, demonstrating strong transferability and efficiency gains. The study highlights both the security implications for LLMs and the need for robust defenses, while offering a scalable framework for future jailbreak investigations and defense research.

Abstract

As the scale and complexity of jailbreaking attacks on large language models (LLMs) continue to escalate, their efficiency and practical applicability are constrained, posing a profound challenge to LLM security. Jailbreaking techniques have advanced from manual prompt engineering to automated methodologies. Recent advances have automated jailbreaking approaches that harness LLMs to generate jailbreak instructions and adversarial examples, delivering encouraging results. Nevertheless, these methods universally include an LLM generation phase, which, due to the complexities of deploying and reasoning with LLMs, impedes effective implementation and broader adoption. To mitigate this issue, we introduce \textbf{Adversarial Prompt Distillation}, an innovative framework that integrates masked language modeling, reinforcement learning, and dynamic temperature control to distill LLM jailbreaking prowess into smaller language models (SLMs). This methodology enables efficient, robust jailbreak attacks while maintaining high success rates and accommodating a broader range of application contexts. Empirical evaluations affirm the approach's superiority in attack efficacy, resource optimization, and cross-model versatility. Our research underscores the practicality of transferring jailbreak capabilities to SLMs, reveals inherent vulnerabilities in LLMs, and provides novel insights to advance LLM security investigations. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.

Paper Structure

This paper contains 37 sections, 7 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Measuring the complexity of Mainstream generative jailbreaks. The left shows the time complexity, and the right shows the space complexity.
  • Figure 2: The structure of APD framework: a). The pre-training phase.b). The Model distillation and reinforcement optimization attack stage. This framework is a multi-stage knowledge distillation method that aims to transfer the knowledge and ability of Llama adversarial generation as a teacher model to BERT as a lightweight student model.
  • Figure 2: Comparison of Time with Other Methods, APD Demonstrates a clear advantage in Per-Sample Time.
  • Figure 3: The trends of ASR$_k$ and ASR$_l$ for Llama-2-7b
  • Figure 4: The trends of ASR$_k$ and ASR$_l$ for Llama-2-13b
  • ...and 3 more figures