Table of Contents
Fetching ...

COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences

Yixin Liu, Argyris Oikonomou, Weiqiang Zheng, Yang Cai, Arman Cohan

TL;DR

COMAL addresses the limitations of Bradley-Terry-based alignment by formulating general-preference LLM alignment as a symmetric two-player zero-sum game with Nash equilibrium guarantees. It introduces a convergent meta-algorithm that iteratively solves KL-regularized subgames via a prox-like update and updates a reference policy to achieve last-iterate convergence to the unregularized Nash policy. Theoretical results establish monotone KL decrease and last-iterate convergence, while empirical evidence shows strong win-rate improvements on synthetic tests and real-world LLMs, with $>60.2\%$ win rates for Llama-3-8B-Instruct and $>56.9\%$ for Qwen2.5-7B against baselines. The approach is compatible with existing preference-learning methods, enabling practical deployment and broad applicability to robust LLM alignment under general preferences.

Abstract

Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is not always sufficient to capture the full range and complexity of general human preferences. We explore RLHF under a general preference framework by modeling the alignment problem as a two-player zero-sum game in a game-theoretic framework, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous self-play algorithms for finding the Nash policy either diverge or only converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets. COMAL is simple and can be integrated with many existing methods designed for preference optimization with minimal changes, and empirically it consistently maintains above 60.2% and 56.8% win rates, when applied to Llama-3-8B-Instruct and Qwen2.5-7B, against all compared algorithms under controlled evaluations.

COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences

TL;DR

COMAL addresses the limitations of Bradley-Terry-based alignment by formulating general-preference LLM alignment as a symmetric two-player zero-sum game with Nash equilibrium guarantees. It introduces a convergent meta-algorithm that iteratively solves KL-regularized subgames via a prox-like update and updates a reference policy to achieve last-iterate convergence to the unregularized Nash policy. Theoretical results establish monotone KL decrease and last-iterate convergence, while empirical evidence shows strong win-rate improvements on synthetic tests and real-world LLMs, with win rates for Llama-3-8B-Instruct and for Qwen2.5-7B against baselines. The approach is compatible with existing preference-learning methods, enabling practical deployment and broad applicability to robust LLM alignment under general preferences.

Abstract

Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is not always sufficient to capture the full range and complexity of general human preferences. We explore RLHF under a general preference framework by modeling the alignment problem as a two-player zero-sum game in a game-theoretic framework, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous self-play algorithms for finding the Nash policy either diverge or only converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets. COMAL is simple and can be integrated with many existing methods designed for preference optimization with minimal changes, and empirically it consistently maintains above 60.2% and 56.8% win rates, when applied to Llama-3-8B-Instruct and Qwen2.5-7B, against all compared algorithms under controlled evaluations.

Paper Structure

This paper contains 60 sections, 11 theorems, 62 equations, 5 figures, 6 tables, 11 algorithms.

Key Result

Theorem 1

We assume that there exists a Nash equilibrium $\pi^\star$ of $J(\pi_1,\pi_2)$ (defined in eq:game objective) such that $\mathrm{supp}(\pi^\star) = \mathrm{supp}(\pi_{\mathrm{init}})$. In every iteration $t \ge 1$, it holds that $\mathop{\mathrm{KL}}\nolimits(\pi^\star|| \pi^{t+1}) \le \mathop{\math

Figures (5)

  • Figure 1: (a) convergence behavior of five methods (§\ref{['sec:syn-exp']}); (b) win-rate comparison with Llama-3 (§\ref{['sec:llm']}).
  • Figure 2: Comparisons of Iterative IPO (Iter-IPO), INPO, and COMAL. The average win rates of the trained checkpoints at each iteration against each training algorithm are displayed.
  • Figure 3: Dyanmics on a simple $3$-dimensional preference game. The unique Nash equilibrium is $[4/11, 3/11, 4/11]$ represented as red star. We initialize all algorithms at the blue dot point $[0.2, 0.5, 0.3]$.
  • Figure 4: Dyanmics on a simple $3$-dimensional preference game. The unique Nash equilibrium is $[4/11, 3/11, 4/11]$ represented as red star. We initialize all algorithms at the blue dot point $[0.2, 0.5, 0.3]$.
  • Figure 5: Comparisons of Iterative IPO (Iter-IPO), INPO, and COMAL. The average win rates of the trained checkpoints against the best checkpoints of each training algorithm, and the average lengths of the outputs are compared. For INPO, two variations with a small regularization ($\eta^{-1}=0.002$, INPO-Small) and a large regularization ($\eta^{-1}=0.01$, INPO-Large) are compared.

Theorems & Definitions (22)

  • Definition 1: General Preference Model
  • Theorem 1
  • Theorem 2
  • Definition 3: Bradley-Terry Model
  • Lemma 1
  • Corollary 1
  • Lemma 2
  • proof
  • Corollary 2
  • proof
  • ...and 12 more