Table of Contents
Fetching ...

Trading Off Diversity and Quality in Natural Language Generation

Hugh Zhang, Daniel Duckworth, Daphne Ippolito, Arvind Neelakantan

TL;DR

This paper reframes open-ended natural language generation as a multi-objective optimization problem balancing quality and diversity, and conducts the first large-scale, fair evaluation of decoding algorithms along the quality-diversity frontier. It formalizes a framework combining human-judged quality with Shannon entropy to define a joint objective, and experimentally shows nucleus/top-p sampling yields superior quality when diversity is controlled, while random sampling is diverse but low-quality. The study also confirms the existence of the likelihood trap, quantifies the relationship between model likelihood and human judgments, and introduces selective sampling as a tractable approach to globally-normalized temperature sampling, along with theoretical and empirical analyses of its limitations. The findings offer practical guidance for choosing decoding strategies in open-ended tasks and lay groundwork for further improvements in adaptive decoding and cross-modal generation.

Abstract

For open-ended language generation tasks such as storytelling and dialogue, choosing the right decoding algorithm is critical to controlling the tradeoff between generation quality and diversity. However, there presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them. We address these issues by casting decoding as a multi-objective optimization problem aiming to simultaneously maximize both response quality and diversity. Our framework enables us to perform the first large-scale evaluation of decoding methods along the entire quality-diversity spectrum. We find that when diversity is a priority, all methods perform similarly, but when quality is viewed as more important, the recently proposed nucleus sampling (Holtzman et al. 2019) outperforms all other evaluated decoding algorithms. Our experiments also confirm the existence of the `likelihood trap', the counter-intuitive observation that high likelihood sequences are often surprisingly low quality. We leverage our findings to create and evaluate an algorithm called \emph{selective sampling} which tractably approximates globally-normalized temperature sampling.

Trading Off Diversity and Quality in Natural Language Generation

TL;DR

This paper reframes open-ended natural language generation as a multi-objective optimization problem balancing quality and diversity, and conducts the first large-scale, fair evaluation of decoding algorithms along the quality-diversity frontier. It formalizes a framework combining human-judged quality with Shannon entropy to define a joint objective, and experimentally shows nucleus/top-p sampling yields superior quality when diversity is controlled, while random sampling is diverse but low-quality. The study also confirms the existence of the likelihood trap, quantifies the relationship between model likelihood and human judgments, and introduces selective sampling as a tractable approach to globally-normalized temperature sampling, along with theoretical and empirical analyses of its limitations. The findings offer practical guidance for choosing decoding strategies in open-ended tasks and lay groundwork for further improvements in adaptive decoding and cross-modal generation.

Abstract

For open-ended language generation tasks such as storytelling and dialogue, choosing the right decoding algorithm is critical to controlling the tradeoff between generation quality and diversity. However, there presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them. We address these issues by casting decoding as a multi-objective optimization problem aiming to simultaneously maximize both response quality and diversity. Our framework enables us to perform the first large-scale evaluation of decoding methods along the entire quality-diversity spectrum. We find that when diversity is a priority, all methods perform similarly, but when quality is viewed as more important, the recently proposed nucleus sampling (Holtzman et al. 2019) outperforms all other evaluated decoding algorithms. Our experiments also confirm the existence of the `likelihood trap', the counter-intuitive observation that high likelihood sequences are often surprisingly low quality. We leverage our findings to create and evaluate an algorithm called \emph{selective sampling} which tractably approximates globally-normalized temperature sampling.

Paper Structure

This paper contains 14 sections, 2 theorems, 11 equations, 12 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

Let $p$ be a probability distribution over some finite set $\mathcal{X}$. Let $H$ be the Shannon entropy function. The probability distribution $Q$ which minimizes the reverse KL Divergence $D_{KL}\infdivx{Q}{P}$ subject to $H(Q) = K$ for any achievable constant $K$ has the form, for some temperature $\tau \in [0, 1]$.

Figures (12)

  • Figure 1: The Likelihood Trap. We asked 146 crowdworkers to rate the quality of 100 sentences across a variety of model likelihoods. While model log likelihoods are generally positively correlated with average human quality judgments, we notice an inflection point after which they become negatively correlated. Each point in the graph represents the average crowdworker rating of 5 sentences with similar model likelihoods. We discuss this phenomenon in more depth in Section \ref{['likelihoodtrap']}.
  • Figure 2: Any choice of temperature for local temperature sampling must have $P(A) = P(B)$. However, choosing global temperature $\tau = 0.5$ results in $P(A) = 0.5763$ and $P(B) = 0.4237$ which is impossible for any choice of local temperatures to satisfy.
  • Figure 3: Histogram over $\mathop{\mathrm{p_{model}}}\nolimits(x)$ for samples drawn from the same prompt. 99.5% of samples have log likelihood less than the choosen cutoff $\alpha$ shown in black.
  • Figure 4: Human judgment scores as a function of decoding algorithm's entropy. Each point represents a single choice of decoding algorithm and hyperparameter. Error bars represent 95% bootstrap confidence intervals.
  • Figure 5: Human judgment scores for each decoding algorithm and hyperparameter choice. "Selective" is selective sampling and "model" is sampling directly from the probability distribution outputted by the language model. A score of 0 represents no preference. Selective sampling underperforms other more computationally efficient strategies.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof