Table of Contents
Fetching ...

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao

TL;DR

The paper tackles the vulnerability of aligned LLMs to jailbreak prompts by introducing AutoDAN-Turbo, a black-box, lifelong-learning framework that autonomously discovers, evolves, and retrieves jailbreak strategies. It integrates three modules—attack generation with scoring, a strategy library built from attack logs, and a retrieval mechanism—that together enable continual improvement and reuse of strategies, while remaining compatible with human-designed prompts. Extensive Harmbench and cross-model evaluations show AutoDAN-Turbo achieving state-of-the-art attack success and StrongREJECT scores, along with strong transferability across models and datasets and improved test-time efficiency. The work demonstrates both the potential of autonomous red-teaming to expose safety gaps and the need to balance such capabilities with defensive safeguards and resource considerations.

Abstract

In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

TL;DR

The paper tackles the vulnerability of aligned LLMs to jailbreak prompts by introducing AutoDAN-Turbo, a black-box, lifelong-learning framework that autonomously discovers, evolves, and retrieves jailbreak strategies. It integrates three modules—attack generation with scoring, a strategy library built from attack logs, and a retrieval mechanism—that together enable continual improvement and reuse of strategies, while remaining compatible with human-designed prompts. Extensive Harmbench and cross-model evaluations show AutoDAN-Turbo achieving state-of-the-art attack success and StrongREJECT scores, along with strong transferability across models and datasets and improved test-time efficiency. The work demonstrates both the potential of autonomous red-teaming to expose safety gaps and the need to balance such capabilities with defensive safeguards and resource considerations.

Abstract

In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.
Paper Structure (51 sections, 1 equation, 7 figures, 10 tables, 3 algorithms)

This paper contains 51 sections, 1 equation, 7 figures, 10 tables, 3 algorithms.

Figures (7)

  • Figure 1: Left: our method AutoDAN-Turbo achieves the best attack performance compared with other black-box baselines in Harmbench mazeika2024harmbench, surpassing the runner-up by a large margin. Right: our method AutoDAN-Turbo autonomously discovers jailbreak strategies without human intervention and generates jailbreak prompts based on the specific strategies it discovers.
  • Figure 2: The pipeline of AutoDAN-Turbo
  • Figure 3: Our methodology defines a jailbreak strategy as text modifications that increase the jailbreak score, identifying these strategies by comparing differences between consecutive attack logs where a higher score indicates an improved strategy. AutoDAN-Turbo will systematically construct a strategy library, storing data on these strategies and using response embeddings for efficient retrieval, with strategies summarized and formatted for easy access.
  • Figure 4: The transferability of the strategies developed by Gemma-7B-it attacker across different datasets.
  • Figure :
  • ...and 2 more figures