Automated Progressive Red Teaming
Bojian Jiang, Yi Jing, Tianhao Shen, Tong Wu, Qing Yang, Deyi Xiong
TL;DR
The paper addresses the challenge of safely evaluating large language models (LLMs) by proposing Automated Progressive Red Teaming (APRT), an effectively learnable framework that combines three modular LLMs—Intention Expanding ($\M_{\rm exp}$), Intention Hiding ($\M_{\rm hid}$), and Evil Maker—with dual Reward LLMs to progressively discover and obscure vulnerabilities. A novel metric, Attack Effectiveness Rate ($AER$), replaces traditional ASR to better align automated assessments with human judgments of unsafe-but-helpful responses. APRT demonstrates strong attack performance and transferability across open-source targets (e.g., Vicuna, Llama-2, Llama-3) and closed-source targets (GPT-4o, Claude-3.5), with significant improvements over baselines and consistent correlation with human evaluation. The work highlights the importance of learning-to-red-team via progressive training and active data selection, offering a scalable approach for robust LLM safety evaluation and guidance for mitigating transferable vulnerabilities. Overall, APRT advances practical, scalable red-teaming for LLMs and provides actionable insights into vulnerabilities that can inform defense and policy.
Abstract
Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming (ART) offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current ART efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively explore and exploit LLM vulnerabilities through multi-round interactions. In addition to the framework, we further propose a novel indicator, Attack Effectiveness Rate (AER) to mitigate the limitations of existing evaluation metrics. By measuring the likelihood of eliciting unsafe but seemingly helpful responses, AER aligns closely with human evaluations. Extensive experiments with both automatic and human evaluations, demonstrate the effectiveness of ARPT across both open- and closed-source LLMs. Specifically, APRT effectively elicits 54% unsafe yet useful responses from Meta's Llama-3-8B-Instruct, 50% from GPT-4o (API access), and 39% from Claude-3.5 (API access), showcasing its robust attack capability and transferability across LLMs (especially from open-source LLMs to closed-source LLMs).
