Table of Contents
Fetching ...

Abstractive Red-Teaming of Language Model Character

Nate Rahn, Allison Qi, Avery Griffin, Jonathan Michala, Henry Sleight, Erik Jones

TL;DR

The paper tackles the problem of deploying language-model assistants that adhere to a fixed character specification by introducing abstractive red-teaming, which seeks high-level natural-language categories of queries likely to trigger violations. It proposes a category-centric framework with a category generator, a query generator, and principle-specific reward signals, and two algorithms—Category-Level Reinforcement Learning (CRL) and Query-Category Iteration (QCI)—to efficiently discover troubling categories. Across seven target models and a 12-principle specification, CRL and QCI outperform baselines and surface qualitatively meaningful categories, revealing risks such as AI supremacy claims, illicit tool recommendations, and sexist content under certain prompts. The approach provides a practical, scalable method for pre-deployment auditing and informs future safety training and constitution-based alignment efforts.

Abstract

We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. "The query is in Chinese. The query asks about family roles," that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character.

Abstractive Red-Teaming of Language Model Character

TL;DR

The paper tackles the problem of deploying language-model assistants that adhere to a fixed character specification by introducing abstractive red-teaming, which seeks high-level natural-language categories of queries likely to trigger violations. It proposes a category-centric framework with a category generator, a query generator, and principle-specific reward signals, and two algorithms—Category-Level Reinforcement Learning (CRL) and Query-Category Iteration (QCI)—to efficiently discover troubling categories. Across seven target models and a 12-principle specification, CRL and QCI outperform baselines and surface qualitatively meaningful categories, revealing risks such as AI supremacy claims, illicit tool recommendations, and sexist content under certain prompts. The approach provides a practical, scalable method for pre-deployment auditing and informs future safety training and constitution-based alignment efforts.

Abstract

We want language model assistants to conform to a character specification, which asserts how the model should act across diverse user interactions. While models typically follow these character specifications, they can occasionally violate them in large-scale deployments. In this work, we aim to identify types of queries that are likely to produce such character violations at deployment, using much less than deployment-level compute. To do this, we introduce abstractive red-teaming, where we search for natural-language query categories, e.g. "The query is in Chinese. The query asks about family roles," that routinely elicit violations. These categories abstract over the many possible variants of a query which could appear in the wild. We introduce two algorithms for efficient category search against a character-trait-specific reward model: one based on reinforcement learning on a category generator LLM, and another which leverages a strong LLM to iteratively synthesize categories from high-scoring queries. Across a 12-principle character specification and 7 target models, we find that our algorithms consistently outperform baselines, and generate qualitatively interesting categories; for example, queries which ask Llama-3.1-8B-Instruct to predict the future lead to responses saying that AI will dominate humanity, and queries that ask GPT-4.1-Mini for essential prison survival items lead to enthusiastic recommendation of illegal weapons. Overall, we believe our results represent an important step towards realistic pre-deployment auditing of language model character.
Paper Structure (30 sections, 3 equations, 24 figures, 3 tables, 2 algorithms)

This paper contains 30 sections, 3 equations, 24 figures, 3 tables, 2 algorithms.

Figures (24)

  • Figure 1: We introduce abstractive red-teaming. Our framework involves searching for natural-language categories, each describing some large set of user queries, such that many of a target model's responses to those queries violate some character specification. By doing so, we surface character failures which are likely to occur at deployment, since a user submitting an unseen query within a category will trigger similar behavior.
  • Figure 2: Abstractive red-teaming discovers categories of user queries which elicit diverse and unexpected character violations. Here, we show three boxes, each of which represents a single category which leads to violations of some principle of a character specification, when queries in that category are submitted to some target model ( section). For each category, we show several examples of queries within that category, and the corresponding responses from the target model ( section).
  • Figure 3: We present two algorithms for discovering query categories which frequently produce character violations. CRL discovers good categories through reinforcement learning on a category generator LLM, using a category-level reward signal obtained by aggregating the rewards of responses to queries within a category. QCI maintains an experience pool of high-scoring queries, from which we synthesize a category using an LLM. At each step, QCI explores new queries through two paths: First, it samples queries based on random categories from a fixed category generator. Second, QCI subsamples attributes from the existing best category, and samples queries within these subset categories which are adjacent to but different from the current category. By merging the queries which produce the most violative responses into the experience pool, we create selective pressure which isolates the category attributes responsible for character violations.
  • Figure 4: To implement abstractive red-teaming, we start with a dataset of natural user queries, and some character specification made up of several principles. We synthetically generate categories describing each query using an LLM, and use the resulting datasets to train category generator and query generator LLMs. We then leverage these models to generate training data for a principle-specific reward model. Specifically, we mix violative categories, obtained by prompting an LLM with the principle, with random categories sampled from the category generator, and then sample queries in both types of categories using the query generator. Finally, we prompt an LLM to generate responses of varying quality to each query, and use an LLM judge to produce preferences over the resulting query-response pairs. We train the reward model on the resulting preference data.
  • Figure 5: Comparing CRL and QCI (ours) against RS (baseline) across a varying number of queries to the target model, we find that CRL and QCI are significantly more efficient at finding high-scoring categories. Additionally, QCI is more sample-efficient than CRL in the query-limited regime.
  • ...and 19 more figures