Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games

Chengdong Ma; Ziran Yang; Hai Ci; Jun Gao; Minquan Gao; Xuehai Pan; Yaodong Yang

Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games

Chengdong Ma, Ziran Yang, Hai Ci, Jun Gao, Minquan Gao, Xuehai Pan, Yaodong Yang

TL;DR

This work tackles safety vulnerabilities in large language models by reframing red-teaming as a dynamic, multi-round, multi-agent game (RTG). It introduces the Gamified Red Team Solver (GRTS), a population-based PSRO-style method that expands red/blue policy sets and optimizes meta-strategies to converge toward approximate Nash equilibrium while maintaining semantic diversity to mitigate mode collapse. The authors prove convergence properties via generalized fictitious play and validate the approach empirically, showing improved attack diversity, stronger defenses, and safer alignment across multiple open-source models. The study reveals a spinning-top geometric structure in RTG, underscoring the necessity of diverse attacker populations to robustly assess and improve LLM safety and toxicity detection, with practical implications for scalable alignment and threat discovery.

Abstract

The primary challenge in deploying Large Language Model (LLM) is ensuring its harmlessness. Red team can identify vulnerabilities by attacking LLM to attain safety. However, current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams. These static approaches lead to significant reductions in generation diversity, known as the mode collapse, which makes it difficult to discover the potential risks in the increasingly complex human-LLM interactions. Here we introduce dynamic Red Team Game (RTG) to comprehensively analyze the multi-round offensive and defensive interactions between red team and blue team. Furthermore, we develop a Gamified Red Team Solver (GRTS) with diversity measures to mitigate mode collapse and theoretically guarantee the convergence of approximate Nash equilibrium which results in better strategies for both teams. Empirical results demonstrate that GRTS explore diverse and implicit attacks to adaptively exploit various LLMs, surpassing the constraints of specific modes. Insightfully, the geometrical structure we unveil of the red team task aligns with the spinning top hypothesis, confirming the necessity of constructing a diverse LLM population as a promising proxy for heterogeneous human expert red-teamers. This paves the way for scalable toxicity detection and safe alignment for LLMs.

Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games

TL;DR

Abstract

Paper Structure (44 sections, 2 theorems, 24 equations, 15 figures, 14 tables, 3 algorithms)

This paper contains 44 sections, 2 theorems, 24 equations, 15 figures, 14 tables, 3 algorithms.

Introduction
Related Work
Problem Formulation
Markov Decision Process for Token Generation
Extensive-form Game in Dialogue
Gamified Red Teaming Solver
Solving Meta Game of Red-teaming LLMs
Diversity Measure of Semantic Space
Experiments and Results
Solving RTG
Training setup
Evaluation Metrics
The overall game-solving process
The geometrical structure of RTG
The Best Response Iteration for Red Team
...and 29 more sections

Key Result

Proposition 1

(Approximate Nash Convergence of GRTS). If $f$ is concave, and GRTS uses the update rule: Here, $\alpha_{t}=o(1 / \log t)$ is a deterministic parameter, and $\boldsymbol{Y}_{t+1}^{i}$ represents the differences between the expected and actual changes in policies. Consequently, GRTS exhibits an analogous convergence property to that of Generalized Weakened Fictitious Play (GWFP): the poli

Figures (15)

Figure 1: The process of Red Teaming Game in multi-round dialogue. The red team continuously outputs toxic prompts during the dialogue, attempting to guide the blue team to output toxic content. The left side outlines the existing technical routes of red team LLM, primarily divided into human red teams and automated red teams. The human red teams consist of diverse human experts utilizing heuristic prompts, thus incurring high costs and inefficiencies. The automated red team approaches are more efficient and scalable; however, current methods focus on single-round attacks and static blue team targets, leading to mode collapse. Our approach also falls within the realm of automated red team methods but is grounded in dynamic game-theoretic principles, enabling diversified multi-round attacks. This improves the depth of interaction between red team and blue team, enabling successful attacks in subsequent rounds following the initial failure (on the right side).
Figure 2: A bi-level optimization framework including MDPTG and ETGD. The generation process of each token in a sentence represents each decision step in MDPTG, and the completion of a sentence generation represents the end of one MDPTG iteration, at which point either the red team LLM or the blue team LLM generates a sentence. Furthermore, within the multi-round sentence-level interaction between the red team and the blue team, each sentence represents a decision step in ETGD, where the red team and the blue team alternate decisions, constituting an extensive-form game. Our method optimizes these two processes separately at the token and sentence levels, known as RTG, ultimately refining the output policies of red team and blue team LLMs towards approximate Nash equilibrium.
Figure 3: The process of Red Teaming Game in multi-round dialogue. The red team continuously outputs toxic prompts during the dialogue, attempting to guide the blue team to output toxic content. 1. Initialize the population of red team and blue team strategies with a count of 1 each, and initialize a set of LLM for the two team. 2. Select a policy (LLM) from the population of red team and blue team respectively to interact with the opponent's population through multi-round RTG interaction, and utilize the interaction dialogue history for training (obtaining the best response policy). 3. Incorporate the latest best response policy obtained into the population, constructing a toxicity matrix based on the meta game between populations (Meta RTG). 4. Using Nash solvers or other solvers to solve the meta RTG and obtain the restricted approximate Nash equilibrium strategy distribution of sub-games (new strategy distribution). 5. Using the new strategy distribution as the initial strategy for the next round of GRTS iteration.
Figure 4: (a) shows the variation in exploitability during the iterative solving process of GRTS, reflecting changes in proximity to approximate Nash equilibrium. (b) and (c) respectively demonstrate the standard deviation and variance changes of the payoff during the training process, confirming that the geometric structure of the RTG is a spinning top.
Figure 5: Training results for GRTS: (a) Attack Success Rate (ASR) visualized via a heatmap, showcasing the output payoff between different blue teams and red teams from various iteration of GRTS. It shows decomposed results over three rounds in the attack-defense interactions. Note that in 1st round, prompts are from training prompt dataset so only blue team varies. (b) further highlights optimization pathways for both red and blue team, the z-axis is the average ASR in (a) over 3 rounds.
...and 10 more figures

Theorems & Definitions (4)

Definition 1
Definition 2
Proposition 1
Proposition 1

Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games

TL;DR

Abstract

Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (4)