Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

Sanae Lotfi; Yilun Kuang; Brandon Amos; Micah Goldblum; Marc Finzi; Andrew Gordon Wilson

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

Sanae Lotfi, Yilun Kuang, Brandon Amos, Micah Goldblum, Marc Finzi, Andrew Gordon Wilson

TL;DR

This paper introduces token-level generalization bounds for large language models by leveraging martingale properties to exploit the abundance of training tokens, enabling non-vacuous guarantees for models up to 70B parameters with post-training quantization. It develops a novel non-IID token-level bound, couples it with practical compression methods (Monarch, Kronecker, LoRA) and QuIP 2-bit quantization, and demonstrates strong empirical alignment with downstream performance on pretrained GPT2 and LLaMA variants as well as antibody-design tasks. The token-level approach reduces the impact of the complexity term by increasing the number of samples, while prediction smoothing further tightens the bounds. Altogether, the results show that practically deployed LLMs can enjoy meaningful generalization guarantees that reflect their actual performance, with significant implications for model design and deployment in real-world settings.

Abstract

Large language models (LLMs) with billions of parameters excel at predicting the next token in a sequence. Recent work computes non-vacuous compression-based generalization bounds for LLMs, but these bounds are vacuous for large models at the billion-parameter scale. Moreover, these bounds are obtained through restrictive compression techniques, bounding compressed models that generate low-quality text. Additionally, the tightness of these existing bounds depends on the number of IID documents in a training set rather than the much larger number of non-IID constituent tokens, leaving untapped potential for tighter bounds. In this work, we instead use properties of martingales to derive generalization bounds that benefit from the vast number of tokens in LLM training sets. Since a dataset contains far more tokens than documents, our generalization bounds not only tolerate but actually benefit from far less restrictive compression schemes. With Monarch matrices, Kronecker factorizations, and post-training quantization, we achieve non-vacuous generalization bounds for LLMs as large as LLaMA2-70B. Unlike previous approaches, our work achieves the first non-vacuous bounds for models that are deployed in practice and generate high-quality text.

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

TL;DR

Abstract

Paper Structure (39 sections, 2 theorems, 11 equations, 4 figures, 9 tables)

This paper contains 39 sections, 2 theorems, 11 equations, 4 figures, 9 tables.

Introduction
Related Work
Background
Token-Level Generalization Bounds
A Novel Non-IID Token-Level Generalization Bound
Sampling and Empirical Risk Evaluation
Token-level Bounds Are Predictive of Generalization
Token-Level Prediction Smoothing
Compressing LLMs to Minimize Complexity
Efficient Nonlinear Parametrizations
QuIP 2-Bit Quantization of LLM
Non-Vacuous Bounds for LLMs with Billions of Parameters
Token-level Bounds via Nonlinear Parametrizations
Non-vacuous Bounds for Pretrained LLMs: GPT2, LLaMA1 and LLaMA2
Token-Level Generalization Bounds on Antibody Sequences
...and 24 more sections

Key Result

Theorem 1

With probability at least $1-\delta$ over the randomness in a sampled sequence $\{x_1,x_2,\dots, x_m\}$, if the negative log likelihood of a model $h\in \mathcal{H}$ can be bounded $- \log_2 p_h( \cdot |x_{<i}) \in [a,a+\Delta_i]$, then the negative log likelihood of the data for model $h$ satisfies where $\hat{\Delta} = \sqrt{\frac{1}{m}\sum_{i=1}^m \Delta_i^2}$, the expectation is taken over $X_

Figures (4)

Figure 1: Non-vacuous bounds for LLMs that scale up to 70B parameters.Left: Bits per dimension (BPD) bounds on the Amber dataset liu2023llm360 which contains $1.2$ trillion tokens for different LLMs from the LLaMA family ranging in scale from 7 billion to 70 billion parameters touvron2023LLaMA. All of these models are quantized to $2$-bits, $3$-bits and $4$-bits per-weight using QuIP# and are publicly available tseng2024quip. The different quantization precisions are accounted for in the compressed model size. The trade-off between the empirical performance and the model complexity in our bounds favors models with a smaller compressed size in general, though we observe that across different architectures we can find larger models yielding better bounds. Middle: The BPD training loss for different models from the LLaMA family---the legend is shared with the figure on the left. Overall, we observe that larger models yield a lower BPD while having a higher compressed size. Right: Validation negative log-likelihood loss as a function of the total number of trainable parameters for different nonlinear parametrization; namely low rank adaptation (LoRA), the Kronecker decomposition of dense matrices and Monarch matrices. The x-axis is in the log scale. As we vary the numer of trainable parameters, there are different optimal compression techniques.
Figure 2: Our bounds analyze a quantity that is meaningful and predictive of generalization.Left: Using LLaMA2-7B, we compute the entropy of $p(x_i|x_{<i})$, where the context $x_{<i}$ is fixed and sampled from the Amber training dataset. The distribution over next tokens given a fixed context from the training data is indeed diffuse and characterized by high entropy values. Middle: Entropy of $p(x_i|x_{<i})$ as a function of the token index $i$ shown on the x-axis for a context length $L=1024$. The average entropy has a decreasing trend but remains high overall; note that the average entropy for $i=768$ is as high as the average entropy for $i=128$. Right: On the left $y$-axis, we plot the average zero-shot accuracy (ACC) and perplexity (PPL) achieved by GPT2 models ranging in scale from 117M to 1.5B averaged over downstream datasets, as reported in radford2019language. On the right $y$-axis, we plot an approximation of the conditional BPD expectation that we bound in \ref{['eq:main_bound']} where we resample $x_i$ from a LLaMA2-7B given fixed training contexts $x_{<i}$ from the Amber dataset. The approximation of the BPD objective that we bound achieves $97.9\%$ and $99.1\%$ correlation with the accuracy and perplexity, respectively.
Figure 3: Token-level prediction smoothing improves our bounds.Left: After training, we optimize a conservative upper bound on the generalization bound that we would get from \ref{['eq:main_bound']} with respect to the $\alpha$ head parameters. Doing so yields a noticeable reduction in the value of the bound. Middle: BPD generalization bound as a function of a single global parameter chosen from a discrete number of values vs. the generalization bound for the token-dependent $\alpha$ after optimization. Right: Histogram of the values taken by $\alpha(x_{<i})$ over different inputs.
Figure 4: As language models are compressed, they retain their understanding of patterns, but they forget highly random and unstructured data rapidly. Experiments performed on GPT2 models with datasets created as detailed in \ref{['sec:memorization']}. Compression performed via post-training quantization where lower quantization levels reflect more aggressive compression.

Theorems & Definitions (3)

Theorem 1
Theorem 2
proof

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

TL;DR

Abstract

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)