Table of Contents
Fetching ...

Learning is Forgetting: LLM Training As Lossy Compression

Henry C. Conklin, Tom Hosking, Tan Yi-Chern, Julian Gold, Jonathan D. Cohen, Thomas L. Griffiths, Max Bartolo, Seraphina Goldfarb-Tarrant

Abstract

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.

Learning is Forgetting: LLM Training As Lossy Compression

Abstract

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.

Paper Structure

This paper contains 48 sections, 16 equations, 18 figures.

Figures (18)

  • Figure 1: LLMs Learn an Optimal Compression of the Internet(Left) The information plane for pre-training of the OLMo2 7B model. The horizontal axis shows mutual information between representations and the input (complexity), the vertical axis shows mutual information with the predicted output (expressivity). The dotted line indicates the bound where models are optimally compressed, hue indicates timepoint in training in terms of tokens in billions. Estimates are based on 10,000 samples from the C4 dataset which is a broad crawl of the internet. (Right) The vertical axis shows the OLMo2 7B model's loss on next-token-prediction of C4. The Horizontal axis shows the model's proximity to the bound. Representations begin to approach the bound as the loss saturates.
  • Figure 2: Illustration of Soft Entropy Estimation: (Top) These facets illustrate the normalisation, sampling, and soft assignment formalised in equation \ref{['eq:quantisation']}. (Bottom) Soft Assignments are aggregated into a distribution that describes the space $P(\hat{Z})$ of which we take the Shannon entropy (equation \ref{['eq:aggregate']}). An interactive visual of this process is available https://henryconkl.in/posts/so-u-want/
  • Figure 3: Illustration of conditional probability estimates. An example sentence is provided, assuming word-level tokenization for simplicity. At left are the indices for the input and output tokens when the current input word is wherefore. At right is shown the sub-setting procedure for estimating conditional probabilities. This illustrates how bigram estimates do not compute entropy of two token embeddings, rather the embedding for the current token embedding conditioned on preceding context.
  • Figure 4: Models Largely Encode Local Context. (Top) The information plane over pre-training for the different levels of backoff. By changing how many tokens we condition the mutual information on in the context window, we see how the OLMo2 7B model compresses not just token but also local context information. Across all context windows we see the same two phase pattern predicted by the Information Bottleneck -- with more contextual representations approaching greater optimality - indicated by hue. As context increases models compress both the target and source over training, rather than compressing the target independently. We hypothesise this is because in language modelling with full context the target and source distributions are nearly identical. (Bottom) By computing the conditional mutual information for a level of back-off given the others we can quantify what proportion of a model's information encodes each level of context information. Each facet shows a different model size, with the horizontal axis reflecting training step and the vertical axis reflecting proportion of information from the source -- hue indicates level of back-off
  • Figure 5: Models Converge Along the Bound With Smaller Models Struggling to Compress. (Top Left) Open-Weights models across 6 families at the end of training, lie along the bound on optimal compression. Hue indicates performance on MMLU Pro. (Top Right) The vertical axis indicates mutual information with preference, with models with more preference information exhibiting better performance (Bottom) Zooming in on later pre-training for each model size the 1B model matches Phase 1 but struggles to achieve meaningful compression later on, oscillating for much of pre-training off the frontier. All results use back-off to the trigram level. A full legend identifying each dot, with additional levels of back-off, is given in Appendix Figures \ref{['fig:full_token_plane']} and \ref{['fig:full_bigram_plane']}.
  • ...and 13 more figures