Table of Contents
Fetching ...

Non-Vacuous Generalization Bounds for Large Language Models

Sanae Lotfi, Marc Finzi, Yilun Kuang, Tim G. J. Rudner, Micah Goldblum, Andrew Gordon Wilson

TL;DR

This work delivers the first non-vacuous generalization bounds for pretrained large language models by coupling PAC-Bayes compression bounds with a novel nonlinear SubLoRA parameterization and prediction smoothing to handle the unbounded negative log-likelihood objective. It introduces practical tools—including document-level independence assumptions and subsampling-based bound computation—to scale bound evaluation to datasets with billions of tokens, and demonstrates non-vacuous bounds for GPT-2 architectures up to nearly a billion parameters. Empirically, larger models exhibit tighter bounds and greater compressibility, suggesting genuine generalization beyond memorization, and the framework reveals how text structure influences generalization. The approach provides a quantitative, compressibility-based lens on LLM generalization with a scalable pipeline applicable to future, larger models and datasets, offering a principled benchmark for understanding and certifying LLM generalization performance.

Abstract

Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, accelerating bound computation by orders of magnitude on massive datasets. To achieve the extreme level of compression required for non-vacuous bounds, we devise SubLoRA, a simple low-dimensional nonlinear parameterization that leads to non-vacuous generalization bounds for models with nearly a billion parameters. Finally, we use our bounds to understand LLM generalization and find that larger models have better generalization bounds and are more compressible than smaller models.

Non-Vacuous Generalization Bounds for Large Language Models

TL;DR

This work delivers the first non-vacuous generalization bounds for pretrained large language models by coupling PAC-Bayes compression bounds with a novel nonlinear SubLoRA parameterization and prediction smoothing to handle the unbounded negative log-likelihood objective. It introduces practical tools—including document-level independence assumptions and subsampling-based bound computation—to scale bound evaluation to datasets with billions of tokens, and demonstrates non-vacuous bounds for GPT-2 architectures up to nearly a billion parameters. Empirically, larger models exhibit tighter bounds and greater compressibility, suggesting genuine generalization beyond memorization, and the framework reveals how text structure influences generalization. The approach provides a quantitative, compressibility-based lens on LLM generalization with a scalable pipeline applicable to future, larger models and datasets, offering a principled benchmark for understanding and certifying LLM generalization performance.

Abstract

Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, accelerating bound computation by orders of magnitude on massive datasets. To achieve the extreme level of compression required for non-vacuous bounds, we devise SubLoRA, a simple low-dimensional nonlinear parameterization that leads to non-vacuous generalization bounds for models with nearly a billion parameters. Finally, we use our bounds to understand LLM generalization and find that larger models have better generalization bounds and are more compressible than smaller models.
Paper Structure (27 sections, 2 theorems, 20 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 2 theorems, 20 equations, 4 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1.1

Consider a bounded risk $R(h,x_i) \in [a,a+\Delta]$ and a finite hypothesis space $h\in \mathcal{H}$ for which we have a prior $P(h)$ that does not depend on $\{x_i\}$. Let the empirical risk $\hat{R}(h) = \frac{1}{m}\sum_{i=1}^m R(h,x_i)$ be a sum over independent random variables $R(h,x_i)$ for a

Figures (4)

  • Figure 1: Finding solutions that simultaneously achieve low training error and low complexity with SubLoRA.(Left): The Pareto frontier of model complexity (the 2nd term in \ref{['eq:bound']}) and the empirical risk (bits per dimension (BPD) and Top-1 Error) of language models using LoRA and subspace compression for next token prediction pretraining. The generalization bound is formed from the sum of the two axes (lower is better), with the shaded region showing where bounds are vacuous. Combining both LoRA and subspace compression in the form of SubLoRA yields the best bounds, while using LoRA alone yields vacuous bounds for top-1 error. (Right): SubLoRA enables a smooth tradeoff over the extent of model compression for a fixed model, finding the degree of compression that is optimal for the situation in constructing the generalization bounds. We plot the contributions of the empirical risk and the complexity term to the bound as a function of this degree of compression.
  • Figure 2: Varying Parameters of the Compression Bounds.(Left): A plot of the generalization bound as a function of the projection dimension $d$ with LoRA. The subspace dimension gives us a way to explicitly trade off the degree of compression with the empirical risk, and we optimize $d$ to produce the best bounds. (Right): A plot of the worst case range of BPD values $\Delta$, empirical risk, and the resulting generalization bounds as a function of the prediction smoothing parameter $\alpha$. For each model, a different alpha can be chosen after the models have already been trained.
  • Figure 3: Larger models achieve stronger generalization bounds. As we scale up the size of the model via the model parameters (holding the training set fixed), we find that our generalization bounds get better rather than worse. Dots show models trained with differing degrees of compression, indicated by their color. On the right we show the number of bits required to express the training dataset using the model and including the model weights in the compression. Classification error bounds consistently favor smaller models, while data compression favors much larger models, and BPD bounds are in between.
  • Figure 4: Breaking text structure with permutations. We compute bounds for LLMs that were trained with the order of the tokens shuffled within each sequence.

Theorems & Definitions (4)

  • Theorem 1.1
  • proof
  • Theorem 1.2
  • proof