Table of Contents
Fetching ...

Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

Mingzhi Wang, Chengdong Ma, Qizhi Chen, Linjian Meng, Yang Han, Jiancong Xiao, Zhaowei Zhang, Jing Huo, Weijie J. Su, Yaodong Yang

TL;DR

This work tackles the misalignment risks in RLHF by moving beyond BT assumptions and average-iterate convergence, introducing Magnetic Preference Optimization (MPO) which achieves last-iterate convergence to the Nash equilibrium of the original two-player constant-sum game through Magnetic Mirror Descent (MMD). The authors establish a two-stage convergence framework: linear last-iterate convergence to the NE of regularized games and iterative magnet updates that drive these NEs toward the NE of the original game, enabling a single final model to reflect authentic human preferences. They provide a practical RLHF implementation with token-level MMD, REINFORCE-based advantages, and sequential KL tracking, alongside MPO-RT variants that integrate KL effects directly into rewards. Empirical results on safety alignment and general capability benchmarks show MPO delivering notable improvements over baselines and affirming the value of self-play in robust LLM alignment, with ablations confirming the importance of periodically updating the reference policy. Overall, MPO offers a scalable, theoretically sound, and practically effective approach to aligning LLMs with diverse human preferences while avoiding the storage and misalignment issues of prior methods.

Abstract

Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-player constant-sum game. However, existing methods either guarantee only average-iterate convergence, incurring high storage and inference costs, or converge to the NE of a regularized game, failing to accurately reflect true human preferences. In this paper, we introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game, effectively overcoming the limitations of existing methods. Building upon Magnetic Mirror Descent (MMD), MPO attains a linear convergence rate, making it particularly suitable for fine-tuning LLMs. To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment.

Magnetic Preference Optimization: Achieving Last-iterate Convergence for Language Model Alignment

TL;DR

This work tackles the misalignment risks in RLHF by moving beyond BT assumptions and average-iterate convergence, introducing Magnetic Preference Optimization (MPO) which achieves last-iterate convergence to the Nash equilibrium of the original two-player constant-sum game through Magnetic Mirror Descent (MMD). The authors establish a two-stage convergence framework: linear last-iterate convergence to the NE of regularized games and iterative magnet updates that drive these NEs toward the NE of the original game, enabling a single final model to reflect authentic human preferences. They provide a practical RLHF implementation with token-level MMD, REINFORCE-based advantages, and sequential KL tracking, alongside MPO-RT variants that integrate KL effects directly into rewards. Empirical results on safety alignment and general capability benchmarks show MPO delivering notable improvements over baselines and affirming the value of self-play in robust LLM alignment, with ablations confirming the importance of periodically updating the reference policy. Overall, MPO offers a scalable, theoretically sound, and practically effective approach to aligning LLMs with diverse human preferences while avoiding the storage and misalignment issues of prior methods.

Abstract

Self-play methods have demonstrated remarkable success in enhancing model capabilities across various domains. In the context of Reinforcement Learning from Human Feedback (RLHF), self-play not only boosts Large Language Model (LLM) performance but also overcomes the limitations of traditional Bradley-Terry (BT) model assumptions by finding the Nash equilibrium (NE) of a preference-based, two-player constant-sum game. However, existing methods either guarantee only average-iterate convergence, incurring high storage and inference costs, or converge to the NE of a regularized game, failing to accurately reflect true human preferences. In this paper, we introduce Magnetic Preference Optimization (MPO), a novel approach capable of achieving last-iterate convergence to the NE of the original game, effectively overcoming the limitations of existing methods. Building upon Magnetic Mirror Descent (MMD), MPO attains a linear convergence rate, making it particularly suitable for fine-tuning LLMs. To ensure our algorithm is both theoretically sound and practically viable, we present a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. Empirical results demonstrate that MPO can significantly enhance the performance of LLMs, highlighting the potential of self-play methods in alignment.

Paper Structure

This paper contains 39 sections, 14 theorems, 83 equations, 12 figures, 10 tables, 2 algorithms.

Key Result

Theorem 3.2

Consider the MMD update rule in equation eq:mmd. Assume $\pi^{k+1} \in \operatorname{int} \operatorname{dom} \psi$ and $\Pi$ is bounded, $F$ is monotone and L-smooth with respect to $\|\cdot\|$, $g$ is 1-strongly convex relative to $\psi$ over $\Pi$ with $g$ differentiable over $\operatorname{int} \ where $\alpha>0$ is the regularization temperature and $\eta>0$ is the learning rate.

Figures (12)

  • Figure 1: Kuhn Poker Experiments.
  • Figure 2: An illustration of MPO and its background. Non-transitive preferences are prevalent across diverse populations, necessitating a more generalized preference model that frames the alignment problem as a two-player constant-sum game. Existing methods either converge to the NE of a regularized game or require maintaining multiple models. In contrast, MPO achieves last-iterate convergence to the original NE, aligning with diverse human preferences using only a single model.
  • Figure 3: MD and MMD.
  • Figure 4: MPO demonstrates a steady improvement in win rates across three iterations. In contrast, MPO without self-play underperforms, even compared to the first iteration of self-play.
  • Figure 5: Ablation study.
  • ...and 7 more figures

Theorems & Definitions (30)

  • Definition 2.1: Average-Iterate Convergence
  • Definition 3.1: Last-Iterate Convergence
  • Theorem 3.2: Theorem 3.4, sokota2022unified
  • Lemma 3.3
  • Theorem 3.4
  • Theorem 3.5: Proposition D.7, sokota2022unified
  • Lemma 3.6
  • Theorem 3.7
  • Lemma E.1: Proposition 1, munos2023nash
  • proof
  • ...and 20 more