Compute-Optimal LLMs Provably Generalize Better With Scale

Marc Finzi; Sanyam Kapoor; Diego Granziol; Anming Gu; Christopher De Sa; J. Zico Kolter; Andrew Gordon Wilson

Compute-Optimal LLMs Provably Generalize Better With Scale

Marc Finzi, Sanyam Kapoor, Diego Granziol, Anming Gu, Christopher De Sa, J. Zico Kolter, Andrew Gordon Wilson

TL;DR

This paper develops a tokenwise generalization framework for compute-optimal LLMs by deriving a fully empirical Freedman-type martingale bound that accounts for loss variance and quantization effects. The bound decomposes the generalization gap into the ratio of parameters to data, the per-token loss variance, and the quantization gap, and demonstrates that as models scale along the Chinchilla frontier the variance and quantization terms shrink while the parameter-to-data ratio remains fixed. It introduces smoothing to control worst-case losses and extends the bound to discrete hypothesis classes, then validates the theory with Pythia models and 4-bit quantization, showing decreasing loss variation with size and modest quantization gaps. A second thread analyzes compressibility via Hessian-based QuIP arguments and sublinear information transfer via prequential coding, arguing that larger models are more quantizable and store information more efficiently, which further tightens generalization bounds at scale. Collectively, the results offer a principled, scale-aware explanation for why compute-optimal LLMs generalize better and provide practical, data-driven bounds that improve with model size.

Abstract

Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. This generalization bound can be decomposed into three interpretable components: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As compute-optimal language models are scaled up, the number of parameters per data point remains constant; however, both the loss variance and the quantization error decrease, implying that larger models should have smaller generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows more slowly than their capacity on the compute-optimal frontier. From these findings we produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.

Compute-Optimal LLMs Provably Generalize Better With Scale

TL;DR

Abstract

Compute-Optimal LLMs Provably Generalize Better With Scale

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (23)