Table of Contents
Fetching ...

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques

TL;DR

The paper reframes LM safety as an online, co-evolving two-player zero-sum game where a single model alternates between attacker and defender roles. Through Self-RedTeam, it achieves continuous self-play MARL training with a theoretically grounded safety guarantee (Nash Equilibrium) and empirically demonstrates improved defense robustness and diverse attack discovery, aided by Hidden Chain-of-Thought. The approach shows strong safety gains across multiple model families and benchmarks while maintaining conversational abilities, highlighting a shift from reactive patching to proactive, scalable self-evolution in safety alignment. These results support broader adoption of end-to-end online MARL for robust, autonomous safety improvement of LLMs.

Abstract

Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

TL;DR

The paper reframes LM safety as an online, co-evolving two-player zero-sum game where a single model alternates between attacker and defender roles. Through Self-RedTeam, it achieves continuous self-play MARL training with a theoretically grounded safety guarantee (Nash Equilibrium) and empirically demonstrates improved defense robustness and diverse attack discovery, aided by Hidden Chain-of-Thought. The approach shows strong safety gains across multiple model families and benchmarks while maintaining conversational abilities, highlighting a shift from reactive patching to proactive, scalable self-evolution in safety alignment. These results support broader adoption of end-to-end online MARL for robust, autonomous safety improvement of LLMs.

Abstract

Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

Paper Structure

This paper contains 53 sections, 2 theorems, 7 equations, 7 figures, 15 tables, 1 algorithm.

Key Result

Theorem 1

When the two players' policies converge to a Nash Equilibrium $(\pi_A^*,\pi_D^*)$, it can be shown that for any prompt $y_A$, $r_\theta(y_A,\pi_D^*(y_A))\ge0$, i.e., the response is safe.

Figures (7)

  • Figure 1: Proposed Self-RedTeam framework, in which an LLM plays a red-teaming game by defending against its own generated attacks. The process initiates with the shared LLM policy playing the role of the attacker and receiving a seed prompt (a). This is privately refined into an adversarial attack ($y_A$) using a hidden chain-of-thought process ($y_A^{CoT}$) invisible to the opponent (b). The attack is then passed to the defender (c), which also leverages private thoughts ($y_D^{CoT}$) to process this attack and formulates a public response ($y_D$) (d). A verifier oversees the interaction, scoring both the attack and defense to create a zero-sum adversarial game (e), where the attacker attempts to elicit both harmful responses or refusals of benign queries. Finally, these scores are fed back to both roles for RL training (f), enabling continuous co-evolution and robust safety alignment of the defender.
  • Figure 2: (CAUTION: Offensive and Derogatory Language) t-SNE visualization of SBERT embeddings for adversarial attacks generated by Self-Play and Attacker-Only methods, based on 1000 distinct seed prompts. The spatial distribution illustrates semantic clustering of the generated attack vectors. Notably, the Attacker-Only method tends to utilize similar attack patterns even with different seed prompts and their varied locations in the t-SNE space. Observing the training iterations (and quantitative analysis in Figure \ref{['fig:game_metrics_the_big_plot']}(a,e)), attacks from the Attacker-Only model, while initially scattered, converge into a few dominant modes (e.g., "disinformation campaign", "social media campaign") later in training. In contrast, the Self-Play method generates diverse attacks spanning "U.S. nuclear weapons" details to "eliciting offensive stereotypes". For detailed examination of individual clusters, see Figure \ref{['fig:separate_tsne_diagrams']}.
  • Figure 3: Training metrics. (a, e) Generated Attacks diversity evaluated on a holdout set during training. (b, c, d) Attacker performance metrics for generated attacks. (f, g) Defender performance metrics against attack instances. (h) Average CoT template violation rate. Results show means over 3 runs with 95% confidence intervals (shaded). See § \ref{['sec:results']} for in-depth analysis of the diagrams.
  • Figure 4: Schematic diagram illustrating the self-distillation procedure for generating the SFT dataset. The process involves four steps: (1) A prompt is sampled from a set of benign prompts; (2) The Llama-3.1-8B-Instruct model generates a completion using its default chat template; (3) The original prompt and completion are used to prompt the model in a new session, asking it to retrospectively generate the reasoning process that led to this completion; (4) All three components—original prompt, completion, and generated reasoning—are concatenated to form the final SFT training data.
  • Figure 5: Bootstrapped distributions of evaluation performance across five benchmarks, finetuning Llama-3.1-8B-IT-AB. Each box represents results from 8 different checkpoints per training approach. Higher values are preferred. Self-Play + SFT demonstrates better safety and chat scores, with lower variance across benchmarks compared to Defender + SFT. Despite Self-Play (No CoT)'s strong safety performance, its low WJB:Benign score indicates excessive refusal on benign queries.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 1
  • proof