COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences
Yixin Liu, Argyris Oikonomou, Weiqiang Zheng, Yang Cai, Arman Cohan
TL;DR
COMAL addresses the limitations of Bradley-Terry-based alignment by formulating general-preference LLM alignment as a symmetric two-player zero-sum game with Nash equilibrium guarantees. It introduces a convergent meta-algorithm that iteratively solves KL-regularized subgames via a prox-like update and updates a reference policy to achieve last-iterate convergence to the unregularized Nash policy. Theoretical results establish monotone KL decrease and last-iterate convergence, while empirical evidence shows strong win-rate improvements on synthetic tests and real-world LLMs, with $>60.2\%$ win rates for Llama-3-8B-Instruct and $>56.9\%$ for Qwen2.5-7B against baselines. The approach is compatible with existing preference-learning methods, enabling practical deployment and broad applicability to robust LLM alignment under general preferences.
Abstract
Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is not always sufficient to capture the full range and complexity of general human preferences. We explore RLHF under a general preference framework by modeling the alignment problem as a two-player zero-sum game in a game-theoretic framework, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous self-play algorithms for finding the Nash policy either diverge or only converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets. COMAL is simple and can be integrated with many existing methods designed for preference optimization with minimal changes, and empirically it consistently maintains above 60.2% and 56.8% win rates, when applied to Llama-3-8B-Instruct and Qwen2.5-7B, against all compared algorithms under controlled evaluations.
