Table of Contents
Fetching ...

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Wesley Hanwen Deng, Sunnie S. Y. Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, Leon A Gatys

TL;DR

PersonaTeaming introduces personas into automated red-teaming to expose a wider array of adversarial prompts. By combining fixed RTer and User personas with a dynamic persona generator, the approach improves attack success rates while maintaining prompt diversity relative to RainbowPlus. Quantitative results show ASR gains up to 144% and robust diversity across conditions, with dynamic generation offering broad coverage and increased lexical variety. The work highlights the potential and limitations of integrating persona-driven automation into governance-sensitive evaluation, pointing to future human-in-the-loop and bias-mitigation directions. Overall, PersonaTeaming provides a scalable step toward more diverse, identity-informed automated red-teaming that can complement human practitioners in safety and governance workflows.

Abstract

Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

TL;DR

PersonaTeaming introduces personas into automated red-teaming to expose a wider array of adversarial prompts. By combining fixed RTer and User personas with a dynamic persona generator, the approach improves attack success rates while maintaining prompt diversity relative to RainbowPlus. Quantitative results show ASR gains up to 144% and robust diversity across conditions, with dynamic generation offering broad coverage and increased lexical variety. The work highlights the potential and limitations of integrating persona-driven automation into governance-sensitive evaluation, pointing to future human-in-the-loop and bias-mitigation directions. Overall, PersonaTeaming provides a scalable step toward more diverse, identity-informed automated red-teaming that can complement human practitioners in safety and governance workflows.

Abstract

Recent developments in AI governance and safety research have called for red-teaming methods that can effectively surface potential risks posed by AI models. Many of these calls have emphasized how the identities and backgrounds of red-teamers can shape their red-teaming strategies, and thus the kinds of risks they are likely to uncover. While automated red-teaming approaches promise to complement human red-teaming by enabling larger-scale exploration of model behavior, current approaches do not consider the role of identity. As an initial step towards incorporating people's background and identities in automated red-teaming, we develop and evaluate a novel method, PersonaTeaming, that introduces personas in the adversarial prompt generation process to explore a wider spectrum of adversarial strategies. In particular, we first introduce a methodology for mutating prompts based on either "red-teaming expert" personas or "regular AI user" personas. We then develop a dynamic persona-generating algorithm that automatically generates various persona types adaptive to different seed prompts. In addition, we develop a set of new metrics to explicitly measure the "mutation distance" to complement existing diversity measurements of adversarial prompts. Our experiments show promising improvements (up to 144.1%) in the attack success rates of adversarial prompts through persona mutation, while maintaining prompt diversity, compared to RainbowPlus, a state-of-the-art automated red-teaming method. We discuss the strengths and limitations of different persona types and mutation methods, shedding light on future opportunities to explore complementarities between automated and human red-teaming approaches.

Paper Structure

This paper contains 17 sections, 2 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of PersonaTeaming. AI developers or policymakers can conduct red-teaming with a pre-selected persona, if they have a target audience in mind. Alternatively, for more exploratory and adaptive red-teaming, AI developers and policymakers can use the persona generation option. If they choose persona generation, they can then choose the type of persona they would like to generate for conducting red-teaming. In this work, we explore two persona types: Expert Red-Teamers (RTers) persona type and Regular AI Users (Users) persona type.
  • Figure 2: System prompt PersonaTeaming for mutating seed prompt based on persona. We developed this system prompts drawing from prior work focusing on adversarial prompt mutation samvelyan2024rainbow
  • Figure 3: System prompt PersonaTeaming used for generating Red-teamer personas. This system prompt focuses on generating personas with particular expertise in conducting adversarial prompt mutations. For few-shot examples, we used personas that we wrote ourselves for the fixed persona mutation for the purpose of experiment. When using this prompt, one should be cautious about the potential priming effect that few-shot examples may have on the output.
  • Figure 4: System prompt PersonaTeaming used for generating User personas. This system prompt focuses on generating personas that represent regular, everyday AI users; we highlight this point throughout the prompt, as LLM tend to generate RTers persona even when prompted to generate regular users. Similar to the previous prompts for generating RTers personas, for few-shot examples, we used personas that we wrote ourselves for the fixed persona mutation for the purpose of experiment.
  • Figure 5: System prompt PersonaTeaming for scoring persona's fitness score for mutating a given prompt.
  • ...and 4 more figures