Table of Contents
Fetching ...

STAR: SocioTechnical Approach to Red Teaming Language Models

Laura Weidinger, John Mellor, Bernat Guillen Pegueroles, Nahema Marchal, Ravin Kumar, Kristian Lum, Canfer Akbulut, Mark Diaz, Stevie Bergman, Mikel Rodriguez, Verena Rieser, William Isaac

TL;DR

STARS enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface and improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations.

Abstract

This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.

STAR: SocioTechnical Approach to Red Teaming Language Models

TL;DR

STARS enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface and improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations.

Abstract

This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.
Paper Structure (45 sections, 10 figures, 9 tables)

This paper contains 45 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: STAR procedurally generates parametric instructions to ensure comprehensive AI red teaming.
  • Figure 2: UMAP of the embedding space of dialogues across three red teaming datasets: Anthropic, DEFCON, and STAR; as well as dialogues between a proprietary model and users that were flagged as undesirable by us. Visual inspection of Figure 2 shows similar coverage and clustering of the STAR approach compared to other approaches. Cluster analysis further reveals that STAR results in more intentional thematic clustering based on the red teaming instructions, compared to the other projected red teaming approaches. Each dot indicates a dialogue. For comparability, we downsampled all datasets to include maximum 4000 randomly selected instances.
  • Figure 3: Specific instructions and a diverse annotator pool result in even exploration of attacks against different demographic groups, while maintaining 'demographic matching'.
  • Figure 4: In- and out-group annotations of dialogues targeting hate speech or discriminatory stereotypes against demographic groups. In-group annotations are slightly less likely to mark rules as 'definitely not broken', and slightly more likely to mark them 'definitely broken'. Error bars indicate 95% CI.
  • Figure 5: In- and out-group annotations by rule. Hate speech shows a significant difference between in- and out-group annotators in terms of their likelihood of rating a rule as broken.
  • ...and 5 more figures