Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Yuheng Zhang; Dian Yu; Baolin Peng; Linfeng Song; Ye Tian; Mingyue Huo; Nan Jiang; Haitao Mi; Dong Yu

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu

TL;DR

This work reframes LLM alignment as a general‑preferences two‑player game and introduces Iterative Nash Policy Optimization (INPO), an online no‑regret algorithm based on Online Mirror Descent to approximate the Nash policy via self‑play. By formulating a population loss that can be minimized directly on a preference dataset, INPO avoids estimating per‑response win rates and provides both sublinear regret and last‑iterate convergence guarantees. Empirically, INPO outperforms state‑of‑the‑art online RLHF methods on benchmarks like AlpacaEval 2.0 and Arena‑Hard, particularly when using a preference model as the oracle, and exhibits robust gains across academic benchmarks as well. The approach offers practical, scalable alignment with general human preferences and opens avenues for extending to finite‑sample analyses and full reinforcement learning settings.

Abstract

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

TL;DR

Abstract

Paper Structure (35 sections, 6 theorems, 45 equations, 5 tables, 1 algorithm)

This paper contains 35 sections, 6 theorems, 45 equations, 5 tables, 1 algorithm.

Introduction
Contributions.
Preliminaries
Notations.
General Preference Oracle.
RLHF with BT Model Assumption
Bradley-Terry (BT) Model Assumption.
Direct Preference Optimization (DPO).
RLHF with General Preferences
Nash Policy and Duality Gap.
Algorithm
Online Mirror Descent for Solving Nash Policy
Population Loss
Iterative Nash Policy Optimization Algorithm
Discussion
...and 20 more sections

Key Result

Lemma 2

Under Assumption assum:bound_log, let $D=\max_{\pi \in \Pi} \mathrm{KL}(\pi \Vert \pi_1)$, OMD algorithm in Eq. eq:omd_update with $\eta=\frac{\max(B\tau,1) \sqrt{T}}{\sqrt{D}}$ has the following guarantee:

Theorems & Definitions (12)

Definition 1: General Preference Oracle
Lemma 2: Regret Bound for OMD
Theorem 3: Duality Gap Bound for Uniform Mixture Policy in OMD
Theorem 4: Last-Iterate Convergence for OMD
Lemma 5
Proposition 6
proof
proof
Lemma 7: Lemma 2 in munos2023nash
proof
...and 2 more

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

TL;DR

Abstract

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (12)