Learn Your Reference Model for Real Good Alignment

Alexey Gorbatovski; Boris Shaposhnikov; Alexey Malakhov; Nikita Surnachev; Yaroslav Aksenov; Ian Maksimov; Nikita Balagansky; Daniil Gavrilov

Learn Your Reference Model for Real Good Alignment

Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov

TL;DR

This work tackles offline LLM alignment by addressing reward overoptimization that arises when models drift from a reference policy. It introduces Trust Region Alignment (TR) methods—TR-DPO, TR-IPO, and TR-KTO—that periodically update the reference policy during training via soft or hard updates, mitigating overoptimization while allowing beneficial divergence. Through extensive experiments on task-specific (Anthropic-HH, Reddit TL;DR) and general benchmarks (AlpacaEval 2, Arena-Hard) with Pythia and Llama3, TR methods yield superior human-centric metrics at comparable KL divergence and improve win rates over traditional offline methods. Analyses of KL dynamics, HC metrics, and gradient behavior support the claim that TR updates stabilize optimization and reduce reliance on OOD data, though hyperparameter sensitivity and evaluation modality are noted as considerations for future work.

Abstract

Despite the fact that offline methods for Large Language Models (LLMs) alignment do not require a direct reward model, they remain susceptible to overoptimization. This issue arises when the trained model deviates excessively from the reference policy, leading to a decrease in sample quality. We propose a new paradigm of offline alignment methods, called Trust Region (including variants TR-DPO, TR-IPO, TR-KTO), which dynamically updates the reference policy throughout the training process. Our results show that TR alignment methods effectively mitigate overoptimization, enabling models to maintain strong performance even when substantially deviating from the initial reference policy. We demonstrate the efficacy of these approaches not only through toy examples that exhibit reduced overoptimization, but also through direct, side-by-side comparisons in specific tasks such as helpful and harmless dialogue, as well as summarization, where they surpass conventional methods. Additionally, we report significant improvements in general-purpose assistant setups with the Llama3 model on the AlpacaEval 2 and Arena-Hard benchmarks, highlighting the advantages of Trust Region methods over classical approaches.

Learn Your Reference Model for Real Good Alignment

TL;DR

Abstract

Paper Structure (45 sections, 15 equations, 19 figures, 13 tables)

This paper contains 45 sections, 15 equations, 19 figures, 13 tables.

Introduction
Related Work
Trust Region Alignment
Motivation
Method
Experiments
Experimental Setup
Tasks
Models
Update Strategies
Evaluation
Performance Comparison on the Two Tasks
General Benchmarks Evaluation
Divergence and Overoptimization Analysis
Discussion
...and 30 more sections

Figures (19)

Figure 1: Evaluation performance of models trained by different methods, measured on the Alpaca Eval (a) and Arena Hard (b) benchmarks. The Llama-3-Base model was used as the baseline. The SFT stage was conducted on the UltraChat dataset, and the alignment stage on UltraFeedback. We compare vanilla methods (DPO, IPO, KTO) (left bars), their versions with a soft reference policy update (center bars), and with a hard update (right bars). Standard deviations are shown in the left image, while the 95% confidence intervals are indicated in the right one. See Section \ref{['benchs']} for more details.
Figure 2: Results for (a) DPO, (b) $\text{TR-DPO}^\alpha$ with soft update ($\alpha = 0.6$), and (c) $\text{TR-DPO}^\tau$ with hard update ($\tau = 8$) on the toy MDP problem 2406.02900. The top rows represent the probabilities of OOD sequences, while the bottom rows show the probabilities of chosen and rejected sequences. For vanilla DPO, a portion of the probability mass spans over OOD examples. In contrast, the probability mass decreases for OOD sequences in both TR-DPO methods, indicating reduced overoptimization. We evaluated $100$ runs with different seeds and plotted the mean and standard deviation values. See Section \ref{['method']} for more details.
Figure 3: Schematic illustration of the proposed method. While vanilla DPO (left) uses a fixed reference policy during the training, for TR-DPO, we update it either with soft-update (center), for which parameters of $\pi_{\theta}$ are merged into parameters of $\pi_{\text{ref}}$ with some weight $\alpha$, or with hard-update (right), for which we copy parameters of $\pi_{\theta}$ into a reference policy once, in a predetermined number of training steps. See Section \ref{['method']} for more details.
Figure 4: AutoSxS comparisons of the Pythia 2.8B model TR-DPO$^{\alpha}$ (Eq. \ref{['soft-update']}) and TR-DPO$^{\tau}$ (Eq. \ref{['hard-update']}) against the DPO baseline for (a) the Anthropic-HH and (b) Reddit TL;DR datasets. Evaluations of TR-DPO$^{\alpha}$ span $\alpha$ values in [0.1, 0.8], highlighting enhancements particularly within $\alpha = 0.5$ to $\alpha = 0.6$. For TR-DPO$^{\tau}$, $\tau$ is assessed at intervals of $2^n$ for $n = 5, \ldots, 10$, with $\tau$ value of 512 showing statistically significant improvements for both datasets. See Section \ref{['sec:task_comp']} for more details.
Figure 5: The relationship between KL divergence and HC mean value for (a) DPO/TR-DPO, (b) IPO/TR-IPO, and (c) KTO/TR-KTO ($\alpha = 0.6$, $\tau = 512$) across different $\beta$ values. While for low KL values both vanilla and TR methods show similar HC values, as the KL increases, the vanilla methods start to suffer from overoptimization. In contrast, TR methods show better quality at large KL values, supporting the hypothesis from Section \ref{['motivation']}. See Section \ref{['sec:red_over']} for details.
...and 14 more figures

Learn Your Reference Model for Real Good Alignment

TL;DR

Abstract

Learn Your Reference Model for Real Good Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (19)