Table of Contents
Fetching ...

Compute-Optimal LLMs Provably Generalize Better With Scale

Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson

TL;DR

This paper develops a tokenwise generalization framework for compute-optimal LLMs by deriving a fully empirical Freedman-type martingale bound that accounts for loss variance and quantization effects. The bound decomposes the generalization gap into the ratio of parameters to data, the per-token loss variance, and the quantization gap, and demonstrates that as models scale along the Chinchilla frontier the variance and quantization terms shrink while the parameter-to-data ratio remains fixed. It introduces smoothing to control worst-case losses and extends the bound to discrete hypothesis classes, then validates the theory with Pythia models and 4-bit quantization, showing decreasing loss variation with size and modest quantization gaps. A second thread analyzes compressibility via Hessian-based QuIP arguments and sublinear information transfer via prequential coding, arguing that larger models are more quantizable and store information more efficiently, which further tightens generalization bounds at scale. Collectively, the results offer a principled, scale-aware explanation for why compute-optimal LLMs generalize better and provide practical, data-driven bounds that improve with model size.

Abstract

Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

Compute-Optimal LLMs Provably Generalize Better With Scale

TL;DR

This paper develops a tokenwise generalization framework for compute-optimal LLMs by deriving a fully empirical Freedman-type martingale bound that accounts for loss variance and quantization effects. The bound decomposes the generalization gap into the ratio of parameters to data, the per-token loss variance, and the quantization gap, and demonstrates that as models scale along the Chinchilla frontier the variance and quantization terms shrink while the parameter-to-data ratio remains fixed. It introduces smoothing to control worst-case losses and extends the bound to discrete hypothesis classes, then validates the theory with Pythia models and 4-bit quantization, showing decreasing loss variation with size and modest quantization gaps. A second thread analyzes compressibility via Hessian-based QuIP arguments and sublinear information transfer via prequential coding, arguing that larger models are more quantizable and store information more efficiently, which further tightens generalization bounds at scale. Collectively, the results offer a principled, scale-aware explanation for why compute-optimal LLMs generalize better and provide practical, data-driven bounds that improve with model size.

Abstract

Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

Paper Structure

This paper contains 29 sections, 14 theorems, 74 equations, 5 figures.

Key Result

Theorem 3.1

Let $(X_k)_{k=1}^n = X_1, \dots , X_n$ and $(Y_k)_{k=1}^n$ be sequences of random variables adapted to the filtration $(\mathcal{F}_{k})_{k=0}^n$ where $X_k$ is $\mathcal{F}_{k}$ measurable and $Y_k$ is $\mathcal{F}_{k-1}$ measurable. Assume the difference between the two is bounded below: $A_k=(Y_{ where $C := \frac{1}{n}\log\frac{|K|}{\delta}$, and and $v(x) = x-\log(1+x)$.

Figures (5)

  • Figure 1: Pythia models and checkpoints chosen along the compute-optimal frontier (checkpoints given by the marked values).
  • Figure 2: Left: A direct comparison of our evaluated generalization bound, and the empirical loss as a function of model scale. As the model is scaled up, our bound improves just like the empirical loss. Center: Loss variation $\Sigma$ entering into the generalization bound. As the loss deviation decreases, so does the largest term in our bound. Right: Comparison of the relative scale of the contributions to \ref{['eq:full_theorem']}. Here we use a fixed $4$ bit quantization of the parameters.
  • Figure 3: Left: Information content contained in the model as upper bounded by $K(h)$ from the information transfer prequential coding approach vs parameter counting and quantization. Fitting a power law to the prequential $K(h)$ yields $6\times 10^5\cdot N^{0.5\pm0.1}$. While parameter counting gives a better upper bound over the range of Pythia models, the sublinear scaling of the prequential bound means that it overtakes it eventually, somewhere around $30$B sized models. Center: The contributions of the various terms to our generalization bounds when using prequential coding complexity, along with their power law fits. Right: Comparison of generalization bounds produced by the prequential vs quantization based approaches. While the prequential bounds are worse, they follow a power law and improve substantially with scale.
  • Figure 4: Spectral density plots of the $70M$ parameter Pythia model trained on varying fractions of the Pile dataset using the same data and random vector seed.
  • Figure 5: Comparison of spectral density and $\mathop{\mathrm{Tr}}\nolimits(\sqrt{{\bm{H}}})$ estimations for different subsample sizes and configurations.

Theorems & Definitions (23)

  • Theorem 3.1
  • Lemma 3.2
  • Lemma 3.3
  • Theorem 3.4
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof
  • ...and 13 more