Accelerating Nash Learning from Human Feedback via Mirror Prox
Daniil Tiapkin, Daniele Calandriello, Denis Belomestny, Eric Moulines, Alexey Naumov, Kashif Rasul, Michal Valko, Pierre Menard
TL;DR
This work reframes human feedback learning as a symmetric two-player preference game and introduces Nash Mirror Prox (Nash-MP) to compute the $β$-regularized Nash equilibrium with strong convergence guarantees. Nash-MP achieves last-iterate linear convergence in KL-divergence and suboptimality, with rates that scale as $(1+2β)^{-N/2}/β$ and are independent of action-space size, outperforming prior NLHF methods. An approximate, policy-gradient-based variant enables practical deep-learning deployment and practical LLM fine-tuning, with theoretical guarantees linking approximation error to remaining suboptimality. Empirically, Nash-MP demonstrates competitive performance in matrix games and superior results in LLM alignment tasks compared to several baselines, supporting its potential to improve human-preference alignment while maintaining proximity to reference policies.
Abstract
Traditional Reinforcement Learning from Human Feedback (RLHF) often relies on reward models, frequently assuming preference structures like the Bradley-Terry model, which may not accurately capture the complexities of real human preferences (e.g., intransitivity). Nash Learning from Human Feedback (NLHF) offers a more direct alternative by framing the problem as finding a Nash equilibrium of a game defined by these preferences. In this work, we introduce Nash Mirror Prox ($\mathtt{Nash-MP}$), an online NLHF algorithm that leverages the Mirror Prox optimization scheme to achieve fast and stable convergence to the Nash equilibrium. Our theoretical analysis establishes that Nash-MP exhibits last-iterate linear convergence towards the $β$-regularized Nash equilibrium. Specifically, we prove that the KL-divergence to the optimal policy decreases at a rate of order $(1+2β)^{-N/2}$, where $N$ is a number of preference queries. We further demonstrate last-iterate linear convergence for the exploitability gap and uniformly for the span semi-norm of log-probabilities, with all these rates being independent of the size of the action space. Furthermore, we propose and analyze an approximate version of Nash-MP where proximal steps are estimated using stochastic policy gradients, making the algorithm closer to applications. Finally, we detail a practical implementation strategy for fine-tuning large language models and present experiments that demonstrate its competitive performance and compatibility with existing methods.
