On the Entropy Calibration of Language Models
Steven Cao, Gregory Valiant, Percy Liang
TL;DR
The paper analyzes entropy calibration in autoregressive language models, showing that miscalibration—entropy growth during generation—scales slowly with model and data size, especially for text with heavy tails. A simplified power-law intuition links the tail exponent to slow scaling, and empirical results across models up to 70B parameters corroborate slow improvement for text while code shows faster gains. While standard mitigations like temperature adjustment and instruction tuning reduce entropy, they typically worsen log loss, implying a diversity stability tradeoff. The authors prove a theoretical possibility of calibrating entropy without increasing log loss if a future-entropy predictor can be learned, offering a pathway to simultaneous generation stability and diversity. Together, these results motivate developing practical future-entropy-based calibration methods to achieve high-quality, diverse, and stable generations at scale.
Abstract
We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing (and text quality decreasing) as generations grow longer. This error accumulation is a fundamental problem in autoregressive models, and the standard solution is to truncate the distribution, which improves text quality at the cost of diversity. In this paper, we ask: is miscalibration likely to improve with scale, and is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the scaling behavior depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted by the simplified setting: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
