Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games
Chengdong Ma, Ziran Yang, Hai Ci, Jun Gao, Minquan Gao, Xuehai Pan, Yaodong Yang
TL;DR
This work tackles safety vulnerabilities in large language models by reframing red-teaming as a dynamic, multi-round, multi-agent game (RTG). It introduces the Gamified Red Team Solver (GRTS), a population-based PSRO-style method that expands red/blue policy sets and optimizes meta-strategies to converge toward approximate Nash equilibrium while maintaining semantic diversity to mitigate mode collapse. The authors prove convergence properties via generalized fictitious play and validate the approach empirically, showing improved attack diversity, stronger defenses, and safer alignment across multiple open-source models. The study reveals a spinning-top geometric structure in RTG, underscoring the necessity of diverse attacker populations to robustly assess and improve LLM safety and toxicity detection, with practical implications for scalable alignment and threat discovery.
Abstract
The primary challenge in deploying Large Language Model (LLM) is ensuring its harmlessness. Red team can identify vulnerabilities by attacking LLM to attain safety. However, current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams. These static approaches lead to significant reductions in generation diversity, known as the mode collapse, which makes it difficult to discover the potential risks in the increasingly complex human-LLM interactions. Here we introduce dynamic Red Team Game (RTG) to comprehensively analyze the multi-round offensive and defensive interactions between red team and blue team. Furthermore, we develop a Gamified Red Team Solver (GRTS) with diversity measures to mitigate mode collapse and theoretically guarantee the convergence of approximate Nash equilibrium which results in better strategies for both teams. Empirical results demonstrate that GRTS explore diverse and implicit attacks to adaptively exploit various LLMs, surpassing the constraints of specific modes. Insightfully, the geometrical structure we unveil of the red team task aligns with the spinning top hypothesis, confirming the necessity of constructing a diverse LLM population as a promising proxy for heterogeneous human expert red-teamers. This paves the way for scalable toxicity detection and safe alignment for LLMs.
