Table of Contents
Fetching ...

AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

Muxi Diao, Yutao Mou, Keqing He, Hanbo Song, Lulu Zhao, Shikun Zhang, Wei Ye, Kongming Liang, Zhanyu Ma

TL;DR

AutoRed introduces a seed-free, two-stage adversarial prompt framework for automated red teaming of LLMs. By leveraging persona priors for free-form instruction generation and a reflection/refinement loop guided by an instruction verifier, AutoRed produces semantically diverse, high-impact prompts without seed templates. Empirical results across eight LLMs show superior attack success rates and strong cross-model generalization compared with seed-based baselines, revealing persistent safety vulnerabilities in frontier models. The work also demonstrates that AutoRed-derived prompts can aid safety improvements via supervised fine-tuning, and it provides two open datasets (AutoRed-Hard and AutoRed-Medium) to advance robustness evaluation research.

Abstract

The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.

AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

TL;DR

AutoRed introduces a seed-free, two-stage adversarial prompt framework for automated red teaming of LLMs. By leveraging persona priors for free-form instruction generation and a reflection/refinement loop guided by an instruction verifier, AutoRed produces semantically diverse, high-impact prompts without seed templates. Empirical results across eight LLMs show superior attack success rates and strong cross-model generalization compared with seed-based baselines, revealing persistent safety vulnerabilities in frontier models. The work also demonstrates that AutoRed-derived prompts can aid safety improvements via supervised fine-tuning, and it provides two open datasets (AutoRed-Hard and AutoRed-Medium) to advance robustness evaluation research.

Abstract

The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.

Paper Structure

This paper contains 53 sections, 6 equations, 12 figures, 13 tables.

Figures (12)

  • Figure 1: t-SNE visualization of prompt representations from seed prompts, seed-based red teaming methods, and AutoRed.
  • Figure 2: Quantitative and qualitative comparison between AutoRed and other red teaming methods.
  • Figure 3: AutoRed workflow includes two main stages. In stage 1: Adversarial Attacks on Target Models, an attack model generates small batches of adversarial instructions guided by persona data, aiding in training an instruction verifier. In stage 2: Reflection and Refinement, larger-scale adversarial instructions are filtered by the verifier and then iteratively refined in a reflection loop.
  • Figure 4: AutoRed Industry Background Distribution.
  • Figure 5: AutoRed Skill Level Distribution.
  • ...and 7 more figures