Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Mickel Liu; Liwei Jiang; Yancheng Liang; Simon Shaolei Du; Yejin Choi; Tim Althoff; Natasha Jaques

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques

TL;DR

The paper reframes LM safety as an online, co-evolving two-player zero-sum game where a single model alternates between attacker and defender roles. Through Self-RedTeam, it achieves continuous self-play MARL training with a theoretically grounded safety guarantee (Nash Equilibrium) and empirically demonstrates improved defense robustness and diverse attack discovery, aided by Hidden Chain-of-Thought. The approach shows strong safety gains across multiple model families and benchmarks while maintaining conversational abilities, highlighting a shift from reactive patching to proactive, scalable self-evolution in safety alignment. These results support broader adoption of end-to-end online MARL for robust, autonomous safety improvement of LLMs.

Abstract

Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

TL;DR

Abstract

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (3)