Table of Contents
Fetching ...

A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

Naaman Tan, Josef Valvoda, Tianyu Liu, Anej Svete, Yanxia Qin, Kan Min-Yen, Ryan Cotterell

TL;DR

It is shown that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences.

Abstract

The relationship between the quality of a string, as judged by a human reader, and its probability, $p(\boldsymbol{y})$ under a language model undergirds the development of better language models. For example, many popular algorithms for sampling from a language model have been conceived with the goal of manipulating $p(\boldsymbol{y})$ to place higher probability on strings that humans deem of high quality. In this article, we examine the probability--quality relationship in language models explicitly aligned to human preferences, e.g., through reinforcement learning through human feedback. We show that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.

A Probability--Quality Trade-off in Aligned Language Models and its Relation to Sampling Adaptors

TL;DR

It is shown that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences.

Abstract

The relationship between the quality of a string, as judged by a human reader, and its probability, under a language model undergirds the development of better language models. For example, many popular algorithms for sampling from a language model have been conceived with the goal of manipulating to place higher probability on strings that humans deem of high quality. In this article, we examine the probability--quality relationship in language models explicitly aligned to human preferences, e.g., through reinforcement learning through human feedback. We show that, when sampling corpora from an aligned language model, there exists a trade-off between the strings' average reward and average log-likelihood under the prior language model, i.e., the same model before alignment with human preferences. We provide a formal treatment of this phenomenon and demonstrate how a choice of sampling adaptor allows for a selection of how much likelihood we exchange for the reward.
Paper Structure (38 sections, 10 theorems, 46 equations, 4 figures, 1 algorithm)

This paper contains 38 sections, 10 theorems, 46 equations, 4 figures, 1 algorithm.

Key Result

Proposition 1

where $\delta = {\mathcal{O}}(\frac{1}{N})$ and $C \mathrel{{\stackrel{\textnormal{\tiny def}}{=}}} {\mathrm{H}}({\boldsymbol{Y}} \mid {A} = {\textsc{+}}) - \log {{Z}({\textsc{+}})}$ is a constant, and we use the shorthands $\log {p}({\mathcal{Y}}) = \sum_{n=1}^N \log {p}({{\boldsymbol{y}}^{(n)}})$

Figures (4)

  • Figure 1: Illustration of the probability--quality trade-off with toy data, where quality is measured by the reward function. (Left) "String"-level correlations between probability and reward, where strings are mimicked by arbitrary objects. (Right) Corpus-level correlations between average log-probability and average reward. We include a best-fit line for corpora in the typical set, i.e., those with sample entropy close to ${\mathrm{H}}({{p}_{{\textsc{+}}}})$. In both figures, the log-probability of each string or corpus is coloured according to high (dark) and low (light).
  • Figure 2: The probability--quality relationship, where quality is measured by the reward function. (Left) String-level correlations between log-probability and quality. (Right) Corpus-level correlations between average log-probability and average quality, with corpora created by different sampling adaptors. Higher intensity of the colours denote higher temperatures used with the sampling adaptor.
  • Figure 3: The probability--quality relationship in DPO-tuned models, where quality is measured by the secret reward function. (Left) String-level correlations between log-probability and quality. (Right) Corpus-level correlations between average log-probability and average quality, with corpora created by different sampling adaptors. Higher intensity of the colours denote higher temperatures used with the sampling adaptor.
  • Figure 4: Toy models of ${{p}_{{\textsc{+}}}}({x})$, ${p}({x})$ and ${{r}}({x})$ analogous to the distributions over strings.

Theorems & Definitions (32)

  • Proposition 1: Probability--quality trade-off
  • Example 1
  • proof
  • proof
  • Example 2: A Tight LM with Infinite Entropy
  • Proposition 2
  • proof
  • Definition 1: Non-trivial Language Model
  • Definition 2: Rényi Entropy
  • Definition 3
  • ...and 22 more