Aligning language models with human preferences

Tomasz Korbak

Aligning language models with human preferences

Tomasz Korbak

TL;DR

This thesis reframes alignment of language models as conditioning on human preferences, casting the process as Bayesian inference where a base LM is updated by evidence about values and preferences. It shows that KL-regularised reinforcement learning effectively implements this conditioning and clarifies its relation to distribution matching, including Distributional Policy Gradients and conditional extensions to support tasks like translation, summarisation, and code generation. A key insight is that pretraining LM with human preferences (PHF) often yields stronger alignment and robustness than finetuning with feedback alone, with conditional training emerging as a particularly effective approach. Together, the work advocates a complementary, layered view of alignment that integrates RL-based methods, distribution-matching techniques, and pretraining-with-preferences to yield safer, more reliable AI assistants across modalities and tasks.

Abstract

Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.

Aligning language models with human preferences

TL;DR

Abstract

Paper Structure (181 sections, 1 theorem, 68 equations, 30 figures, 12 tables, 3 algorithms)

This paper contains 181 sections, 1 theorem, 68 equations, 30 figures, 12 tables, 3 algorithms.

Introduction
Background
Language models
Self-supervised learning
Scaling and emergence of new capabilities
AI assistants
Risks posed by misaligned language models
Approaches to aligning language models
Prompt engineering
Supervised finetuning
Reinforcement learning from human feedback
Summary
RL with KL penalties is better viewed as Bayesian inference
Introduction
Finetuning language models using standard RL and distribution collapse
...and 166 more sections

Key Result

Theorem 1

Consider the following EBM: and let $p_z$ be the normalised distribution $p_z(x) = \frac{1}{Z}\; P_z(x)$, with $Z=\sum_x P_z(x)$. Then:

Figures (30)

Figure 1: In the chapter, we argue that aligning language models (LMs) with human preferences is a Bayesian inference problem and RL with KL penalties corresponds to solving it via variational inference.
Figure 2: Values of reward, advantage and the baseline for the first 1000 epochs of a pointwise constraint experiment.
Figure 3: Evaluation metrics: $D_{\mathrm{KL}}(p, \pi_{\theta})$ ($\downarrow$ better), $\mathbb{E}_{{\pi_\theta}} \phi(x)$ ($\uparrow$ better), $D_{\mathrm{KL}}(\pi_{\theta}, a)$ ($\downarrow$ better), Distinct-1 ($\uparrow$ better) and Self-BLEU-5 ($\downarrow$ better) aggregated over 6 pointwise constraints experiments (tasks 1-6) for policies obtained from GDC++, GDC, Ziegler and Reinforce. See Figure \ref{['fig:distributional-compare-methods-metrics']} for aggregated distributional constraints experiments. In the Appendix, Figures \ref{['fig:pointwise-compare-methods-split1']}-\ref{['fig:distributional-compare-methods-split']} contain individual view and final results of each run.
Figure 4: $\mathbb{E}_{{\pi_\theta}} \phi(x)$ or $\hat{\mu}$ per constraint ($\uparrow$ better) and $D_{\mathrm{KL}}(p, \pi_{\theta})$ ($\downarrow$ better) as a function of the number of samples reported for task 1 (a) and task 8 (b). We report the number of samples (i.e. the number of epochs times the batch size) for a fair comparison of convergence speed. GDC++ is consistently superior across all batch sizes in terms of convergence and constraint satisfaction. The effect is more conspicuous with small batch sizes. Batch sizes 512 and 2014 are greyed out for clarity.
Figure 5: Comparison between GDC and GDC++ using a set of Variance diagnosis metrics on pointwise and distributional constraints experiments.
...and 25 more figures

Theorems & Definitions (3)

Theorem 1
proof
proof

Aligning language models with human preferences

TL;DR

Abstract

Aligning language models with human preferences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (30)

Theorems & Definitions (3)