Truncation Sampling as Language Model Desmoothing

John Hewitt; Christopher D. Manning; Percy Liang

Truncation Sampling as Language Model Desmoothing

John Hewitt, Christopher D. Manning, Percy Liang

TL;DR

This work reframes truncation sampling as desmoothing a language model's distribution to recover the true support, introducing η-sampling as an entropy-aware truncation method that balances preserving high-probability words with preventing over-truncation. Through MAUVE-driven hyperparameter searches, human assessments, and targeted analyses, η-sampling yields more plausible long-form text and better avoids repetition than existing methods like top-p, while maintaining competitive diversity. The findings highlight the value of principled desmoothing in open-ended generation and provide concrete guidelines and empirical evidence to guide future truncation strategies. Overall, the paper advances understanding of how to generate coherent, diverse long texts from neural LMs by aligning truncation with the underlying smoothing dynamics of the training process.

Abstract

Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms--like top-$p$ or top-$k$ -- address this by setting some words' probabilities to zero at each step. This work provides framing for the aim of truncation, and an improved algorithm for that aim. We propose thinking of a neural language model as a mixture of a true distribution and a smoothing distribution that avoids infinite perplexity. In this light, truncation algorithms aim to perform desmoothing, estimating a subset of the support of the true distribution. Finding a good subset is crucial: we show that top-$p$ unnecessarily truncates high-probability words, for example causing it to truncate all words but Trump for a document that starts with Donald. We introduce $η$-sampling, which truncates words below an entropy-dependent probability threshold. Compared to previous algorithms, $η$-sampling generates more plausible long English documents according to humans, is better at breaking out of repetition, and behaves more reasonably on a battery of test distributions.

Truncation Sampling as Language Model Desmoothing

TL;DR

Abstract

Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms--like top-

or top-

-- address this by setting some words' probabilities to zero at each step. This work provides framing for the aim of truncation, and an improved algorithm for that aim. We propose thinking of a neural language model as a mixture of a true distribution and a smoothing distribution that avoids infinite perplexity. In this light, truncation algorithms aim to perform desmoothing, estimating a subset of the support of the true distribution. Finding a good subset is crucial: we show that top-

unnecessarily truncates high-probability words, for example causing it to truncate all words but Trump for a document that starts with Donald. We introduce

-sampling, which truncates words below an entropy-dependent probability threshold. Compared to previous algorithms,

-sampling generates more plausible long English documents according to humans, is better at breaking out of repetition, and behaves more reasonably on a battery of test distributions.

Paper Structure (53 sections, 17 equations, 6 figures, 5 tables)

This paper contains 53 sections, 17 equations, 6 figures, 5 tables.

Introduction
Background
Language Models
Truncation sampling
Truncation as Desmoothing
KL-divergence and mode covering
A neural LM as a smoothed distribution
A local measure of truncation quality
Principles for truncation as desmoothing
Absolute probability.
Relative probability.
Desmoothing and $n$-gram models
Methods
Top-$p$ (nucleus) sampling
Typical decoding
...and 38 more sections

Figures (6)

Figure 1: A neural LM as a mixture of the true distribution, and a uniform-like smoothing distribution. Truncation aims to approximate the true distribution support.
Figure 2: Portions of unconditional samples from an unsmoothed and uniform-smoothed $5$-gram model; divergence due to leaving the support of the high-order distribution is in red.
Figure 3: Top-$p$ sampling aggressively truncates low-entropy distributions and $\epsilon$-sampling aggressively truncates high-entropy distributions, while $\eta$-sampling strikes a balance.
Figure 4: Unit tests of the truncation behavior of top-$p$, $\epsilon$, and $\eta$-sampling on CheckList-inspired prefixes.
Figure 5: The interface shown to human annotators for Study 1.
...and 1 more figures

Truncation Sampling as Language Model Desmoothing

TL;DR

Abstract

Truncation Sampling as Language Model Desmoothing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)