AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao
TL;DR
The paper tackles the vulnerability of aligned LLMs to jailbreak prompts by introducing AutoDAN-Turbo, a black-box, lifelong-learning framework that autonomously discovers, evolves, and retrieves jailbreak strategies. It integrates three modules—attack generation with scoring, a strategy library built from attack logs, and a retrieval mechanism—that together enable continual improvement and reuse of strategies, while remaining compatible with human-designed prompts. Extensive Harmbench and cross-model evaluations show AutoDAN-Turbo achieving state-of-the-art attack success and StrongREJECT scores, along with strong transferability across models and datasets and improved test-time efficiency. The work demonstrates both the potential of autonomous red-teaming to expose safety gaps and the need to balance such capabilities with defensive safeguards and resource considerations.
Abstract
In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.
