Table of Contents
Fetching ...

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao

TL;DR

ADV-LLM introduces an iterative self-tuning framework that converts a pretrained LLM into a generator of adversarial suffixes to jailbreak victim models. It reduces reliance on costly data collection and search by training the suffix generator on self-generated data, achieving near-perfect ASR on open-source LLMs and strong transfer to GPT-3.5 and GPT-4. The method combines initial suffix/target design with a two-phase training loop and a decreasing temperature to focus the search, and it demonstrates notable generalization to unseen queries and resilience to perplexity-based defenses. The work highlights safety vulnerabilities in current LLMs and offers a scalable dataset generation approach to study and improve safety alignment.

Abstract

Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety. Our code is available at: https://github.com/SunChungEn/ADV-LLM

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

TL;DR

ADV-LLM introduces an iterative self-tuning framework that converts a pretrained LLM into a generator of adversarial suffixes to jailbreak victim models. It reduces reliance on costly data collection and search by training the suffix generator on self-generated data, achieving near-perfect ASR on open-source LLMs and strong transfer to GPT-3.5 and GPT-4. The method combines initial suffix/target design with a two-phase training loop and a decreasing temperature to focus the search, and it demonstrates notable generalization to unseen queries and resilience to perplexity-based defenses. The work highlights safety vulnerabilities in current LLMs and offers a scalable dataset generation approach to study and improve safety alignment.

Abstract

Recent research has shown that Large Language Models (LLMs) are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety. Our code is available at: https://github.com/SunChungEn/ADV-LLM

Paper Structure

This paper contains 37 sections, 2 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: The overview of crafting ADV-LLM. The process begins with refining the target and initializing a starting suffix. ADV-LLM then iteratively generates data for self-tuning.
  • Figure 2: The ASR (LlamaGuard check) with respect to iteration. ADV-LLMs become more powerful when iteration increases, especially for more robust victims like Llama2 and Llama3.
  • Figure 3: Example of jailbreaking GPT4-Turbo (2024-04-09). The suffix is generated by ADV-LLM optimized on Llama3.
  • Figure 4: Example of jailbreaking GPT4-Turbo (2024-04-09). The suffix is generated by ADV-LLM optimized on Llama3.
  • Figure 5: Example of jailbreaking GPT4-Turbo (2024-04-09). The suffix is generated by ADV-LLM optimized on Llama2.