Table of Contents
Fetching ...

OpenAI's Approach to External Red Teaming for AI Models and Systems

Lama Ahmad, Sandhini Agarwal, Michael Lampe, Pamela Mishkin

TL;DR

The paper addresses how OpenAI designs and conducts external red teaming to assess and mitigate risks in frontier AI models and systems. It details design decisions for cohort composition, access controls, interfaces, and documentation, and explains how domain prioritization informs testing scope. It demonstrates how human red teaming data seeds automated evaluations and benchmarks, with concrete examples from GPT-4, GPT-4o, and DALL-E 3 illustrating risk discovery and mitigations. The authors discuss limitations, governance considerations, and the role of external red teaming within a broader risk-management ecosystem to improve safety evaluations and public trust.

Abstract

Red teaming has emerged as a critical practice in assessing the possible risks of AI models and systems. It aids in the discovery of novel risks, stress testing possible gaps in existing mitigations, enriching existing quantitative safety metrics, facilitating the creation of new safety measurements, and enhancing public trust and the legitimacy of AI risk assessments. This white paper describes OpenAI's work to date in external red teaming and draws some more general conclusions from this work. We describe the design considerations underpinning external red teaming, which include: selecting composition of red team, deciding on access levels, and providing guidance required to conduct red teaming. Additionally, we show outcomes red teaming can enable such as input into risk assessment and automated evaluations. We also describe the limitations of external red teaming, and how it can fit into a broader range of AI model and system evaluations. Through these contributions, we hope that AI developers and deployers, evaluation creators, and policymakers will be able to better design red teaming campaigns and get a deeper look into how external red teaming can fit into model deployment and evaluation processes. These methods are evolving and the value of different methods continues to shift as the ecosystem around red teaming matures and models themselves improve as tools for red teaming.

OpenAI's Approach to External Red Teaming for AI Models and Systems

TL;DR

The paper addresses how OpenAI designs and conducts external red teaming to assess and mitigate risks in frontier AI models and systems. It details design decisions for cohort composition, access controls, interfaces, and documentation, and explains how domain prioritization informs testing scope. It demonstrates how human red teaming data seeds automated evaluations and benchmarks, with concrete examples from GPT-4, GPT-4o, and DALL-E 3 illustrating risk discovery and mitigations. The authors discuss limitations, governance considerations, and the role of external red teaming within a broader risk-management ecosystem to improve safety evaluations and public trust.

Abstract

Red teaming has emerged as a critical practice in assessing the possible risks of AI models and systems. It aids in the discovery of novel risks, stress testing possible gaps in existing mitigations, enriching existing quantitative safety metrics, facilitating the creation of new safety measurements, and enhancing public trust and the legitimacy of AI risk assessments. This white paper describes OpenAI's work to date in external red teaming and draws some more general conclusions from this work. We describe the design considerations underpinning external red teaming, which include: selecting composition of red team, deciding on access levels, and providing guidance required to conduct red teaming. Additionally, we show outcomes red teaming can enable such as input into risk assessment and automated evaluations. We also describe the limitations of external red teaming, and how it can fit into a broader range of AI model and system evaluations. Through these contributions, we hope that AI developers and deployers, evaluation creators, and policymakers will be able to better design red teaming campaigns and get a deeper look into how external red teaming can fit into model deployment and evaluation processes. These methods are evolving and the value of different methods continues to shift as the ecosystem around red teaming matures and models themselves improve as tools for red teaming.

Paper Structure

This paper contains 5 sections, 3 figures.

Figures (3)

  • Figure 1: Example areas of testing and motivating questions
  • Figure 2: Pros and cons of different types of model access for red teamers
  • Figure 3: Fig 1. Interface that enables rapid comparison across prompts along with pre-specified questions to enrich findings