Language Alignment via Nash-learning and Adaptive feedback
Ari Azarafrooz, Farshid Faal
TL;DR
This work addresses the challenge of aligning large language models without relying on a pre-learned human preference model or annotated datasets. It introduces LANA, a mirror-descent–style algorithm that leverages adaptive feedback from an improved opponent and self-evaluation to steer learning, with a proxy reward defined as $Q_i^{\\tilde{\\pi}_{-i,t}} = \\log(\\pi_{i,t}/\\tilde{\\pi}_{-i,t})/\\gamma_t$ and a loss that reduces to $\\mathbb{E}_{\\pi_i} \\log (\\tilde{\\pi}_{-i,t}/\\pi_i)$. Experiments on 3K prompts with a small Phi-based model demonstrate improved performance on Alpaca evaluation and MT-Harness/GSM8K without human-annotated preferences, while ablations highlight the importance of base-model quality. The results suggest LANA can achieve self-alignment with robustness to noisy self-evaluation and without annotated data, offering a scalable path toward preference-free alignment. The work also provides a formal convergence discussion, outlining conditions under which average convergence to a Nash equilibrium can be achieved and identifying future directions for efficiency and scaling.
Abstract
Recent research has shown the potential of Nash Learning via Human Feedback for large language model alignment by incorporating the notion of a preference model in a minimax game setup. We take this idea further by casting the alignment as a mirror descent algorithm against the adaptive feedback of an improved opponent, thereby removing the need for learning a preference model or the existence of an annotated dataset altogether. The resulting algorithm, which we refer to as Language Alignment via Nash-learning and Adaptive feedback (LANA), is capable of self-alignment without the need for a human-annotated preference dataset. We support this statement with various experiments and mathematical discussion.
