Toward Optimal LLM Alignments Using Two-Player Games

Rui Zheng; Hongyi Guo; Zhihan Liu; Xiaoying Zhang; Yuanshun Yao; Xiaojun Xu; Zhaoran Wang; Zhiheng Xi; Tao Gui; Qi Zhang; Xuanjing Huang; Hang Li; Yang Liu

Toward Optimal LLM Alignments Using Two-Player Games

Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Hang Li, Yang Liu

TL;DR

The paper reframes LLM alignment as a two-player game between an adversarial prompt generator and a defensive responder, introducing Game-theoretical Preference Optimization (GPO) and a diversity-aware prompt mechanism. It proves convergence to a Nash Equilibrium of the induced zero-sum game and provides an $\,O(T^{-1/2})$ bound on the Nash gap for averaged policies, including an entropy-regularized variant. Empirically, the approach yields improved safety under RLHF comparisons and stronger, more diverse jailbreak prompts, demonstrating enhanced generalization for both agents across safety and jailbreak tasks. The work highlights continuous red-teaming as a practical, adaptive partner for defensive LLM training and outlines avenues to extend the framework to broader domains and align with complementary methods.

Abstract

The standard Reinforcement Learning from Human Feedback (RLHF) framework primarily focuses on optimizing the performance of large language models using pre-collected prompts. However, collecting prompts that provide comprehensive coverage is both tedious and challenging, and often fails to include scenarios that LLMs need to improve on the most. In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. The adversarial agent's task at each step is to generate prompts that expose the weakness of the defensive agent. In return, the defensive agent seeks to improve its responses to these newly identified prompts it struggled with, based on feedback from the reward model. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.

Toward Optimal LLM Alignments Using Two-Player Games

TL;DR

bound on the Nash gap for averaged policies, including an entropy-regularized variant. Empirically, the approach yields improved safety under RLHF comparisons and stronger, more diverse jailbreak prompts, demonstrating enhanced generalization for both agents across safety and jailbreak tasks. The work highlights continuous red-teaming as a practical, adaptive partner for defensive LLM training and outlines avenues to extend the framework to broader domains and align with complementary methods.

Abstract

Paper Structure (30 sections, 6 theorems, 50 equations, 3 figures, 3 tables, 3 algorithms)

This paper contains 30 sections, 6 theorems, 50 equations, 3 figures, 3 tables, 3 algorithms.

Introduction
Preliminary
Game-theoretical Preference Optimization (GPO)
A two-agent game framework for alignment
Application of two-agent alignment in improving LLM safety
Safety rewards
Diversity rewards
Theoretical analysis
Experiments
Baselines.
Main results
Analysis and discussion
Related Work
Conclusion, Limitation and Future Work
Theoretical Analysis
...and 15 more sections

Key Result

Theorem 3.2

By choosing proper parameters $\beta, \eta = \mathcal{O}(\sqrt{T})$, The average policies $\widehat{\pi}_T, \widehat{\mu}_T$ given by the theoretical version of Algorithm alg:general-alg satisfies

Figures (3)

Figure 1: In our approach, we establish a dynamic learning environment where an adversarial agent evaluates the past mistakes and current performance of a defensive agent to pinpoint and exploit potential vulnerabilities. In response, the defensive agent continuously adapts and reinforces these identified weaknesses, thereby improving performance through this iterative process.
Figure 2: Impacts of diversity rewards on our framework with blue background denoting training defensive agents and the red denoting training adversarial agents. As shown in Figures \ref{['fig:safe_reward']} and \ref{['fig:unsafe_reward']}, during the two-player iterative training, the adversarial and defensive agents alternately take effect. Figure \ref{['fig:attack_success_rate']} shows the defensive capabilities of the defensive agent at different steps, illustrating that our method surpasses RLHF across various diversity reward intensities. However, selecting a moderate intensity is preferable.
Figure 3: The impact of temperature sampling on the alignment capabilities of various models shows that our method exhibits more stable performance compared to SFT.

Theorems & Definitions (7)

Definition 3.1: $\epsilon$-approximate Nash Equilibrium
Theorem 3.2
Lemma A.1
Lemma A.2
Theorem A.4
Lemma A.5: Equivalence of maximin and minimax objectives
Lemma A.6: Minimax theorem fan1953minimax

Toward Optimal LLM Alignments Using Two-Player Games

TL;DR

Abstract

Toward Optimal LLM Alignments Using Two-Player Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (7)