Table of Contents
Fetching ...

On the Efficacy of Sampling Adapters

Clara Meister, Tiago Pimentel, Luca Malagutti, Ethan G. Wilcox, Ryan Cotterell

TL;DR

This paper formalizes sampling adapters as simple, plug-in modifications to the per-step conditional distributions of autoregressive language models, unifying common decoding tricks under a single framework. It argues that adapters enact a precision–recall trade-off: they often reduce the model's ability to generate certain strings (lower recall) but improve the likelihood of producing high-quality text (higher precision), aligning with sequence-level quality as measured by Mauve when tuned. Through analyses using reverse cross-entropy, reverse KL divergence, and balanced metrics like TVD/JS, the authors show precision-emphasizing measures correlate with improved text quality, suggesting practical guidance for adapter hyperparameter selection. The work highlights that standard training objectives may misalign with generation goals and that precision-focused measures can serve as efficient proxies for steering decoding choices in open-ended generation settings.

Abstract

Sampling is a common strategy for generating text from probabilistic models, yet standard ancestral sampling often results in text that is incoherent or ungrammatical. To alleviate this issue, various modifications to a model's sampling distribution, such as nucleus or top-k sampling, have been introduced and are now ubiquitously used in language generation systems. We propose a unified framework for understanding these techniques, which we term sampling adapters. Sampling adapters often lead to qualitatively better text, which raises the question: From a formal perspective, how are they changing the (sub)word-level distributions of language generation models? And why do these local changes lead to higher-quality text? We argue that the shift they enforce can be viewed as a trade-off between precision and recall: while the model loses its ability to produce certain strings, its precision rate on desirable text increases. While this trade-off is not reflected in standard metrics of distribution quality (such as perplexity), we find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution. Further, these measures correlate with higher sequence-level quality scores, specifically, Mauve.

On the Efficacy of Sampling Adapters

TL;DR

This paper formalizes sampling adapters as simple, plug-in modifications to the per-step conditional distributions of autoregressive language models, unifying common decoding tricks under a single framework. It argues that adapters enact a precision–recall trade-off: they often reduce the model's ability to generate certain strings (lower recall) but improve the likelihood of producing high-quality text (higher precision), aligning with sequence-level quality as measured by Mauve when tuned. Through analyses using reverse cross-entropy, reverse KL divergence, and balanced metrics like TVD/JS, the authors show precision-emphasizing measures correlate with improved text quality, suggesting practical guidance for adapter hyperparameter selection. The work highlights that standard training objectives may misalign with generation goals and that precision-focused measures can serve as efficient proxies for steering decoding choices in open-ended generation settings.

Abstract

Sampling is a common strategy for generating text from probabilistic models, yet standard ancestral sampling often results in text that is incoherent or ungrammatical. To alleviate this issue, various modifications to a model's sampling distribution, such as nucleus or top-k sampling, have been introduced and are now ubiquitously used in language generation systems. We propose a unified framework for understanding these techniques, which we term sampling adapters. Sampling adapters often lead to qualitatively better text, which raises the question: From a formal perspective, how are they changing the (sub)word-level distributions of language generation models? And why do these local changes lead to higher-quality text? We argue that the shift they enforce can be viewed as a trade-off between precision and recall: while the model loses its ability to produce certain strings, its precision rate on desirable text increases. While this trade-off is not reflected in standard metrics of distribution quality (such as perplexity), we find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution. Further, these measures correlate with higher sequence-level quality scores, specifically, Mauve.
Paper Structure (25 sections, 12 equations, 10 figures, 1 table)

This paper contains 25 sections, 12 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Forward/reverse cross-entropy and tvd of the model with GPT-J and the empirical distribution (WebText test set) after different sampling adapter methods have been applied to the output distribution. Note that as described in \ref{['sec:pr']}, the $\varepsilon$-variant is used in all cross-entropy estimates except for reverse estimates with GPT-J. Dashed lines represent divergence with the unmodified distribution, i.e., the equivalent of using ancestral sampling.
  • Figure 2: Reverse cross-entropy versus forward cross-entropy (the latter uses $\varepsilon$-smoothing) of the model with GPT-J for various sampling adapter and hyperparameter settings. Stars correspond to values at which hyperparameter settings achieved the highest Mauve scores. The black dot corresponds to ancestral sampling.
  • Figure 3: Reverse and forward $\mathrm{KL}$ divergence of the model with GPT-J and the empirical distribution (WebText test set) after different sampling adapter methods have been applied to the output distribution. Note that the $\varepsilon$-method, as described in \ref{['sec:pr']}, is used in all but reverse $\mathrm{KL}$ estimates of models with GPT-J. Dashed lines represent divergence with unmodified distribution, i.e., the equivalent of using ancestral sampling.
  • Figure 4: Mauve scores for text generated using WebText prefixes and different sampling adapters. The dashed lines indicate the scores of samples generated using ancestral sampling.
  • Figure 5: JS divergence of the model with the empirical distribution in the first row and with GPT-J in the second row after different sampling adapter methods have been applied to the output distribution. Dashed lines represent the distance to the unmodified distribution. We observe that at lower temperature values, some NaNs are produced by the $\mathrm{JS}$ computation with the empirical distribution.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Example 3.1
  • Example 3.2
  • Example 3.3
  • Example 3.4
  • Example 3.5
  • Example 3.6