Table of Contents
Fetching ...

Automated Progressive Red Teaming

Bojian Jiang, Yi Jing, Tianhao Shen, Tong Wu, Qing Yang, Deyi Xiong

TL;DR

The paper addresses the challenge of safely evaluating large language models (LLMs) by proposing Automated Progressive Red Teaming (APRT), an effectively learnable framework that combines three modular LLMs—Intention Expanding ($\M_{\rm exp}$), Intention Hiding ($\M_{\rm hid}$), and Evil Maker—with dual Reward LLMs to progressively discover and obscure vulnerabilities. A novel metric, Attack Effectiveness Rate ($AER$), replaces traditional ASR to better align automated assessments with human judgments of unsafe-but-helpful responses. APRT demonstrates strong attack performance and transferability across open-source targets (e.g., Vicuna, Llama-2, Llama-3) and closed-source targets (GPT-4o, Claude-3.5), with significant improvements over baselines and consistent correlation with human evaluation. The work highlights the importance of learning-to-red-team via progressive training and active data selection, offering a scalable approach for robust LLM safety evaluation and guidance for mitigating transferable vulnerabilities. Overall, APRT advances practical, scalable red-teaming for LLMs and provides actionable insights into vulnerabilities that can inform defense and policy.

Abstract

Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming (ART) offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current ART efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively explore and exploit LLM vulnerabilities through multi-round interactions. In addition to the framework, we further propose a novel indicator, Attack Effectiveness Rate (AER) to mitigate the limitations of existing evaluation metrics. By measuring the likelihood of eliciting unsafe but seemingly helpful responses, AER aligns closely with human evaluations. Extensive experiments with both automatic and human evaluations, demonstrate the effectiveness of ARPT across both open- and closed-source LLMs. Specifically, APRT effectively elicits 54% unsafe yet useful responses from Meta's Llama-3-8B-Instruct, 50% from GPT-4o (API access), and 39% from Claude-3.5 (API access), showcasing its robust attack capability and transferability across LLMs (especially from open-source LLMs to closed-source LLMs).

Automated Progressive Red Teaming

TL;DR

The paper addresses the challenge of safely evaluating large language models (LLMs) by proposing Automated Progressive Red Teaming (APRT), an effectively learnable framework that combines three modular LLMs—Intention Expanding (), Intention Hiding (), and Evil Maker—with dual Reward LLMs to progressively discover and obscure vulnerabilities. A novel metric, Attack Effectiveness Rate (), replaces traditional ASR to better align automated assessments with human judgments of unsafe-but-helpful responses. APRT demonstrates strong attack performance and transferability across open-source targets (e.g., Vicuna, Llama-2, Llama-3) and closed-source targets (GPT-4o, Claude-3.5), with significant improvements over baselines and consistent correlation with human evaluation. The work highlights the importance of learning-to-red-team via progressive training and active data selection, offering a scalable approach for robust LLM safety evaluation and guidance for mitigating transferable vulnerabilities. Overall, APRT advances practical, scalable red-teaming for LLMs and provides actionable insights into vulnerabilities that can inform defense and policy.

Abstract

Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming (ART) offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current ART efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively explore and exploit LLM vulnerabilities through multi-round interactions. In addition to the framework, we further propose a novel indicator, Attack Effectiveness Rate (AER) to mitigate the limitations of existing evaluation metrics. By measuring the likelihood of eliciting unsafe but seemingly helpful responses, AER aligns closely with human evaluations. Extensive experiments with both automatic and human evaluations, demonstrate the effectiveness of ARPT across both open- and closed-source LLMs. Specifically, APRT effectively elicits 54% unsafe yet useful responses from Meta's Llama-3-8B-Instruct, 50% from GPT-4o (API access), and 39% from Claude-3.5 (API access), showcasing its robust attack capability and transferability across LLMs (especially from open-source LLMs to closed-source LLMs).
Paper Structure (38 sections, 7 figures, 3 tables, 3 algorithms)

This paper contains 38 sections, 7 figures, 3 tables, 3 algorithms.

Figures (7)

  • Figure 1: Illustration of APRT. In the training process, the Intention Expanding LLM first generates diverse samples that are relatively easy to jailbreak the Target LLM after intention concealment. For each prompt generated by the Intention Expanding LLM, the Intention Hiding LLM transforms it into multiple effective samples with deceptive behavior towards the Target LLM, without changing the original intention of the prompt. The Target LLM dedicates to generating safe responses to resist the attacks from the Intention Hiding LLM. Two Reward LLMs provide a bias to select new incremental training samples for the Intention Hiding LLM. To swiftly enhance the capability of concealing the intentions within input prompts, the Intention Hiding LLM employs an active learning strategy to prioritize selecting samples that can successfully elicit unsafe yet helpful responses from the Target LLM with intentions that are difficult to perceive.
  • Figure 2: The observed trend in AER (Attack Effectiveness Rate) metric with respect to the number of progressive learning iterations for various open-source Target LLMs.
  • Figure 3: During the training of APRT, we systematically visualize the semantic representations generated by both the Intention Expanding LLM and the Intention Hiding LLM. This visualization facilitates a comparative analysis of the initial states and epoch-4 checkpoints. In our visualizations, blue dots denote samples where attacks are not successful, whereas red dots indicate samples where attacks are successful.
  • Figure 4: We systematically compare AER scores across multiple iterative rounds, incorporating both scenarios: with and without the integration of the Intention Expanding LLM.
  • Figure 5: Comparison of data selection algorithms between MART ge2023mart and APRT (Ours, active learning-based method).
  • ...and 2 more figures