STAR: SocioTechnical Approach to Red Teaming Language Models

Laura Weidinger; John Mellor; Bernat Guillen Pegueroles; Nahema Marchal; Ravin Kumar; Kristian Lum; Canfer Akbulut; Mark Diaz; Stevie Bergman; Mikel Rodriguez; Verena Rieser; William Isaac

STAR: SocioTechnical Approach to Red Teaming Language Models

Laura Weidinger, John Mellor, Bernat Guillen Pegueroles, Nahema Marchal, Ravin Kumar, Kristian Lum, Canfer Akbulut, Mark Diaz, Stevie Bergman, Mikel Rodriguez, Verena Rieser, William Isaac

TL;DR

STARS enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface and improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations.

Abstract

This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.

STAR: SocioTechnical Approach to Red Teaming Language Models

TL;DR

Abstract

Paper Structure (45 sections, 10 figures, 9 tables)

This paper contains 45 sections, 10 figures, 9 tables.

Introduction
Background
Steerability
Signal Quality
STAR: SocioTechnical Approach to Red teaming
Improving Steerability
Improving Signal Quality
Expert- and demographic matching
Learning from annotator disagreement
Methods
Data
Task design
Red teaming task
Annotation task
Arbitration task
...and 30 more sections

Figures (10)

Figure 1: STAR procedurally generates parametric instructions to ensure comprehensive AI red teaming.
Figure 2: UMAP of the embedding space of dialogues across three red teaming datasets: Anthropic, DEFCON, and STAR; as well as dialogues between a proprietary model and users that were flagged as undesirable by us. Visual inspection of Figure 2 shows similar coverage and clustering of the STAR approach compared to other approaches. Cluster analysis further reveals that STAR results in more intentional thematic clustering based on the red teaming instructions, compared to the other projected red teaming approaches. Each dot indicates a dialogue. For comparability, we downsampled all datasets to include maximum 4000 randomly selected instances.
Figure 3: Specific instructions and a diverse annotator pool result in even exploration of attacks against different demographic groups, while maintaining 'demographic matching'.
Figure 4: In- and out-group annotations of dialogues targeting hate speech or discriminatory stereotypes against demographic groups. In-group annotations are slightly less likely to mark rules as 'definitely not broken', and slightly more likely to mark them 'definitely broken'. Error bars indicate 95% CI.
Figure 5: In- and out-group annotations by rule. Hate speech shows a significant difference between in- and out-group annotators in terms of their likelihood of rating a rule as broken.
...and 5 more figures

STAR: SocioTechnical Approach to Red Teaming Language Models

TL;DR

Abstract

STAR: SocioTechnical Approach to Red Teaming Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)