Abstractive Red-Teaming of Language Model Character
Nate Rahn, Allison Qi, Avery Griffin, Jonathan Michala, Henry Sleight, Erik Jones
TL;DR
The paper tackles the problem of deploying language-model assistants that adhere to a fixed character specification by introducing abstractive red-teaming, which seeks high-level natural-language categories of queries likely to trigger violations. It proposes a category-centric framework with a category generator, a query generator, and principle-specific reward signals, and two algorithms—Category-Level Reinforcement Learning (CRL) and Query-Category Iteration (QCI)—to efficiently discover troubling categories. Across seven target models and a 12-principle specification, CRL and QCI outperform baselines and surface qualitatively meaningful categories, revealing risks such as AI supremacy claims, illicit tool recommendations, and sexist content under certain prompts. The approach provides a practical, scalable method for pre-deployment auditing and informs future safety training and constitution-based alignment efforts.
Abstract
We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. "The query is in Chinese. The query asks about family roles," that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character.
