A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Naaman Tan; Josef Valvoda; Tianyu Liu; Anej Svete; Yanxia Qin; Kan Min-Yen; Ryan Cotterell

A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Naaman Tan, Josef Valvoda, Tianyu Liu, Anej Svete, Yanxia Qin, Kan Min-Yen, Ryan Cotterell

TL;DR

It is shown that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences.

Abstract

The relationship between the quality of a string, as judged by a human reader, and its probability, $p(\boldsymbol{y})$ under a language model undergirds the development of better language models. For example, many popular algorithms for sampling from a language model have been conceived with the goal of manipulating $p(\boldsymbol{y})$ to place higher probability on strings that humans deem of high quality. In this article, we examine the probability--quality relationship in language models explicitly aligned to human preferences, e.g., through reinforcement learning through human feedback. We show that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.

A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

TL;DR

Abstract

The relationship between the quality of a string, as judged by a human reader, and its probability,

under a language model undergirds the development of better language models. For example, many popular algorithms for sampling from a language model have been conceived with the goal of manipulating

to place higher probability on strings that humans deem of high quality. In this article, we examine the probability--quality relationship in language models explicitly aligned to human preferences, e.g., through reinforcement learning through human feedback. We show that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.

Paper Structure (38 sections, 10 theorems, 46 equations, 4 figures, 1 algorithm)

This paper contains 38 sections, 10 theorems, 46 equations, 4 figures, 1 algorithm.

Introduction
The Probability--Quality Relationship
Learning from Human Feedback
Sampling Adaptors
Theoretical Results
A Fundamental Trade-off
Assumptions.
A Tighter Bound.
Controlling the Trade-off
The Emergence of Simpson's Paradox
Experimental Setup
A Toy Experiment
Modeling ${{p}_{{\textsc{+}}}}$, ${p}$ and ${{r}}$.
Constructing Corpora with Causal Bootstrapping.
The Trade-off in Practice
...and 23 more sections

Key Result

Proposition 1

where $\delta = {\mathcal{O}}(\frac{1}{N})$ and $C \mathrel{{\stackrel{\textnormal{\tiny def}}{=}}} {\mathrm{H}}({\boldsymbol{Y}} \mid {A} = {\textsc{+}}) - \log {{Z}({\textsc{+}})}$ is a constant, and we use the shorthands $\log {p}({\mathcal{Y}}) = \sum_{n=1}^N \log {p}({{\boldsymbol{y}}^{(n)}})$

Figures (4)

Figure 1: Illustration of the probability--quality trade-off with toy data, where quality is measured by the reward function. (Left) "String"-level correlations between probability and reward, where strings are mimicked by arbitrary objects. (Right) Corpus-level correlations between average log-probability and average reward. We include a best-fit line for corpora in the typical set, i.e., those with sample entropy close to ${\mathrm{H}}({{p}_{{\textsc{+}}}})$. In both figures, the log-probability of each string or corpus is coloured according to high (dark) and low (light).
Figure 2: The probability--quality relationship, where quality is measured by the reward function. (Left) String-level correlations between log-probability and quality. (Right) Corpus-level correlations between average log-probability and average quality, with corpora created by different sampling adaptors. Higher intensity of the colours denote higher temperatures used with the sampling adaptor.
Figure 3: The probability--quality relationship in DPO-tuned models, where quality is measured by the secret reward function. (Left) String-level correlations between log-probability and quality. (Right) Corpus-level correlations between average log-probability and average quality, with corpora created by different sampling adaptors. Higher intensity of the colours denote higher temperatures used with the sampling adaptor.
Figure 4: Toy models of ${{p}_{{\textsc{+}}}}({x})$, ${p}({x})$ and ${{r}}({x})$ analogous to the distributions over strings.

Theorems & Definitions (32)

Proposition 1: Probability--quality trade-off
Example 1
proof
proof
Example 2: A Tight LM with Infinite Entropy
Proposition 2
proof
Definition 1: Non-trivial Language Model
Definition 2: Rényi Entropy
Definition 3
...and 22 more

A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

TL;DR

Abstract

A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (32)