Language Alignment via Nash-learning and Adaptive feedback

Ari Azarafrooz; Farshid Faal

Language Alignment via Nash-learning and Adaptive feedback

Ari Azarafrooz, Farshid Faal

TL;DR

This work addresses the challenge of aligning large language models without relying on a pre-learned human preference model or annotated datasets. It introduces LANA, a mirror-descent–style algorithm that leverages adaptive feedback from an improved opponent and self-evaluation to steer learning, with a proxy reward defined as $Q_i^{\\tilde{\\pi}_{-i,t}} = \\log(\\pi_{i,t}/\\tilde{\\pi}_{-i,t})/\\gamma_t$ and a loss that reduces to $\\mathbb{E}_{\\pi_i} \\log (\\tilde{\\pi}_{-i,t}/\\pi_i)$. Experiments on 3K prompts with a small Phi-based model demonstrate improved performance on Alpaca evaluation and MT-Harness/GSM8K without human-annotated preferences, while ablations highlight the importance of base-model quality. The results suggest LANA can achieve self-alignment with robustness to noisy self-evaluation and without annotated data, offering a scalable path toward preference-free alignment. The work also provides a formal convergence discussion, outlining conditions under which average convergence to a Nash equilibrium can be achieved and identifying future directions for efficiency and scaling.

Abstract

Recent research has shown the potential of Nash Learning via Human Feedback for large language model alignment by incorporating the notion of a preference model in a minimax game setup. We take this idea further by casting the alignment as a mirror descent algorithm against the adaptive feedback of an improved opponent, thereby removing the need for learning a preference model or the existence of an annotated dataset altogether. The resulting algorithm, which we refer to as Language Alignment via Nash-learning and Adaptive feedback (LANA), is capable of self-alignment without the need for a human-annotated preference dataset. We support this statement with various experiments and mathematical discussion.

Language Alignment via Nash-learning and Adaptive feedback

TL;DR

and a loss that reduces to

. Experiments on 3K prompts with a small Phi-based model demonstrate improved performance on Alpaca evaluation and MT-Harness/GSM8K without human-annotated preferences, while ablations highlight the importance of base-model quality. The results suggest LANA can achieve self-alignment with robustness to noisy self-evaluation and without annotated data, offering a scalable path toward preference-free alignment. The work also provides a formal convergence discussion, outlining conditions under which average convergence to a Nash equilibrium can be achieved and identifying future directions for efficiency and scaling.

Abstract

Paper Structure (15 sections, 1 theorem, 13 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 1 theorem, 13 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Language Alignment via Nash-learning and Adaptive feedback
Sampling from improved Opponent $\tilde{\pi}$
Algorithm
Experiments
Experiment Setup
Data
Base model
hyperparameters
Results
Ablation study
Mathematical Discussions
LANA loss derivation
Convergence
Future works

Key Result

Lemma 5.1

munos2020fast Let $p\geq 1$ and $q\geq 1$ such that $1/p+1/q=1$. Let $\phi$ be a strongly convex function with respect to the $\ell_p$-norm $\|\cdot \|_p$ with some modulus $\sigma$, i.e., for any $\pi,\pi'$, Write $D_\phi$ the associated Bregman divergence: for $\pi,\pi'$, Let $\delta$ be a vector of dimension $|\mathcal{Y}|$. Define $\pi_{t+1}$ as Then for any $\pi\in \Delta(\mathcal{Y})$, we

Figures (3)

Figure 1: Preference Data set are pre-processed version of the UltraFeedback test dataset cui2023ultrafeedback and different categories in open_hermes_preferences.
Figure 2: MT-Bench: Unlike other self-rewarding LMs, not only is there no drop in reasoning tasks, but there is also a significant increase in GSM8k, matching that of GPT-3.5-turbo. In other tasks, performance seems to be affected slightly.
Figure 3: LANA provides no benefit with Mistral-v1 as foundation model. Preference datasets are pre-processed version of the UltraFeedback test dataset cui2023ultrafeedback and different categories in open_hermes_preferences.

Theorems & Definitions (1)

Lemma 5.1

Language Alignment via Nash-learning and Adaptive feedback

TL;DR

Abstract

Language Alignment via Nash-learning and Adaptive feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (1)