Table of Contents
Fetching ...

On the Entropy Calibration of Language Models

Steven Cao, Gregory Valiant, Percy Liang

TL;DR

The paper analyzes entropy calibration in autoregressive language models, showing that miscalibration—entropy growth during generation—scales slowly with model and data size, especially for text with heavy tails. A simplified power-law intuition links the tail exponent to slow scaling, and empirical results across models up to 70B parameters corroborate slow improvement for text while code shows faster gains. While standard mitigations like temperature adjustment and instruction tuning reduce entropy, they typically worsen log loss, implying a diversity stability tradeoff. The authors prove a theoretical possibility of calibrating entropy without increasing log loss if a future-entropy predictor can be learned, offering a pathway to simultaneous generation stability and diversity. Together, these results motivate developing practical future-entropy-based calibration methods to achieve high-quality, diverse, and stable generations at scale.

Abstract

We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.

On the Entropy Calibration of Language Models

TL;DR

The paper analyzes entropy calibration in autoregressive language models, showing that miscalibration—entropy growth during generation—scales slowly with model and data size, especially for text with heavy tails. A simplified power-law intuition links the tail exponent to slow scaling, and empirical results across models up to 70B parameters corroborate slow improvement for text while code shows faster gains. While standard mitigations like temperature adjustment and instruction tuning reduce entropy, they typically worsen log loss, implying a diversity stability tradeoff. The authors prove a theoretical possibility of calibrating entropy without increasing log loss if a future-entropy predictor can be learned, offering a pathway to simultaneous generation stability and diversity. Together, these results motivate developing practical future-entropy-based calibration methods to achieve high-quality, diverse, and stable generations at scale.

Abstract

We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.

Paper Structure

This paper contains 22 sections, 6 theorems, 59 equations, 8 figures, 3 algorithms.

Key Result

Proposition 3.1

For $v$ infinite and $m$ large, the per-step probability of generating a singleton, in expectation over draws of the training set, is given by where $C_\alpha$ is a constant depending only on $\alpha$, and $K_{m,1}$ is a random variable denoting the number of items seen exactly once in a set of $m$ samples.

Figures (8)

  • Figure 1: Left: the expected total mass of tokens seen exactly once, given m samples from a power law distribution over a vocabulary of size $v$, for three settings of the power law exponent $\alpha = 1.0, 1.25, 1.5$. Their relationship is roughly log-log linear up to $m \approx v/3$, with slope slightly steeper than the asymptotic expression of $1/\alpha - 1$. Right: log frequency versus log rank of the top 5000 unigrams in three datasets. The power law exponent $\alpha$, given by the slope of each curve, is close to $1$ for WikiText and WritingPrompts, while it is $1.5$ for CodeContests, suggesting that text has heavier tails than code. Together, these plots suggest that the singleton mass should decay more slowly with $m$ for WikiText and WritingPrompts than for CodeContests.
  • Figure 2: Log calibration error versus log model size for four model families and three datasets. We find that the scaling laws fit relatively well, suggesting that the relationship between calibration and scale is predictable. Furthermore, while there is variation between model families, the scaling exponents for each dataset are somewhat close to those predicted by theory (WikiText: $0.089$, WritingPrompts: $-0.10$, CodeContests: $-0.33$), with heavier-tailed datasets having slower scaling.
  • Figure 3: Entropy for each generation step (solid) and log loss for each token in the ground truth (dashed), for each dataset (columns) and each model family (rows), with models colored by size. Models have entropy much higher than their log loss, with the gap growing with the number of generation steps, a sign of error accumulation. For the text datasets, models of different sizes seem to be similarly miscalibrated, while for code the degree of miscalibration seems to improve with size.
  • Figure 4: Entropy calibration error versus log loss for base Qwen2.5 (1.5B, 7B, 72B) compared to the instruction-tuned versions, along with various temperature settings (please see Appendix \ref{['appendix:additional']} for all model sizes). Positive calibration error means that the model's entropy is higher than its log loss, while negative means that its entropy is lower than its log loss. We find that each modification of the base model reduces entropy while increasing log loss, calibrating at the cost of diversity.
  • Figure 5: MAUVE for excerpts of model generations plotted against the entropy (in nats) of the excerpt, with models colored by size (see Appendix \ref{['appendix:additional']} for the full plots containing all model families). These plots show that sample quality drops when entropy is too high or low.
  • ...and 3 more figures

Theorems & Definitions (10)

  • Proposition 3.1: informal
  • Theorem 5.2
  • Theorem A.1
  • Lemma A.2
  • Lemma A.3
  • Lemma A.4
  • proof : Proof of Theorem \ref{['theorem:appendix']}
  • proof : Proof of Lemma \ref{['lemma:compute-gradient']}
  • proof : Proof of Lemma \ref{['lemma:logloss']}
  • proof : Proof of Lemma \ref{['lemma:fitting']}