Table of Contents
Fetching ...

Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws

Zhixuan Pan, Shaowen Wang, Jian Li

TL;DR

This work addresses the lack of principled theory for LLMs by reframing training as a data-compression problem grounded in Kolmogorov structure functions. It introduces a hierarchical Syntax-Knowledge model, combining a parametric syntax component with a nonparametric Pitman–Yor knowledge component, analyzed under a Bayesian coding framework. The authors derive data- and model-scaling laws, show how learning progresses from pervasive syntactic regularities to rarer factual knowledge, and provide explanations for hallucinations and fine-tuning dynamics, with empirical validation on synthetic and real data. The framework offers a data-centric lens that unifies prediction, compression, and scaling phenomena, with practical implications for data distribution design, knowledge injection, and instruction tuning.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors observed in LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.

Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws

TL;DR

This work addresses the lack of principled theory for LLMs by reframing training as a data-compression problem grounded in Kolmogorov structure functions. It introduces a hierarchical Syntax-Knowledge model, combining a parametric syntax component with a nonparametric Pitman–Yor knowledge component, analyzed under a Bayesian coding framework. The authors derive data- and model-scaling laws, show how learning progresses from pervasive syntactic regularities to rarer factual knowledge, and provide explanations for hallucinations and fine-tuning dynamics, with empirical validation on synthetic and real data. The framework offers a data-centric lens that unifies prediction, compression, and scaling phenomena, with practical implications for data distribution design, knowledge injection, and instruction tuning.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors observed in LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.

Paper Structure

This paper contains 42 sections, 14 theorems, 122 equations, 11 figures, 4 tables.

Key Result

Theorem 5.2

Under the Bayesian sequential prediction framework and ass: mutual information, the averaged optimal Bayesian redundancy (per sentence) of the hierarchical data model $\phi_{\text{data}}$ satisfies: where $d_{\text{knw}}$ and $d_{\text{syn}}$ are the parameter dimensions of the knowledge and syntax clusters, respectively, and $n_s$ is the number of distinct syntax clusters.

Figures (11)

  • Figure 1: Decomposition of validation loss on knowledge tokens by frequency class, as model size increases. (a) Empirical results on a power-law-distributed dataset: knowledge tokens are grouped into four frequency classes (from most to least frequent) and colored accordingly. We observe the following trend: smaller models capture only the most frequent knowledge (the loss of the most frequent class decreases the first), while larger models gradually acquire less frequent knowledge. Each vertical dashed line marks the model size beyond which further loss reduction for a given frequency class becomes negligible, indicating the irreducible part of the loss. (b) Theoretical prediction of the same loss decomposition (the optimal solution of the constrained optimization problem \ref{['eq:constrainedoptimization']} in Section \ref{['sec:model_scaling_law']}) with irreducible loss part (i.e., the $H(P_\phi)$ term in \ref{['eq:reddef1']}), which reproduces this frequency-dependent acquisition order and plateauing behavior.
  • Figure 2: (a) Accuracy of sufficiently trained models with different sizes across varying input frequencies. When the frequency falls below a model-specific threshold, small models inevitably hallucinate and fail to learn the corresponding facts. (b) Accuracy of different frequency classes (split into four quantiles) under varying model sizes. As model size increases, the model progressively learns the more frequent data first, while infrequent data becomes learnable only at larger scales.
  • Figure 3: (a) Kolmogorov Structure Function View of LLMs: The $\alpha$-axis represents model size, while the y-axis represents the loss of the model, corresponding to code length of data given the model. The anti-diagonal solid straight line is the sufficiency line ($x+y=K(X)$), which is the lower bound of the code length of all possible two-part codes. The upper solid curve represents $h_X(\alpha)$. The dashed blue curve is the test loss of some LLM. (b) Illustration of the hierarchical Syntax-Knowledge model.
  • Figure 4: Kolmogorov Structure Function View of LLMs: The $\alpha$-axis represents model size, while the y-axis represents the loss of the model, corresponding to code length of data given the model. The anti-diagonal solid straight line is the sufficiency line ($x+y=K(X)$), which is the lower bound of the code length of all possible two-part codes. The upper solid curve represents $h_X(\alpha)$. The dashed blue curve is the test loss of some LLM. See more details about the test loss curve in Figure (b) in \ref{['fig:kol_3d']}.
  • Figure 5: (a) Validation loss as a function of training data size. Models trained on data sampled from pretrained knowledge under various power-law distributions (i.e., $p(i) \sim (x + b)^{\text{power}}$) show clear power-law scaling of loss with data size, while uniform sampling does not. A more skewed data distribution leads to faster loss decay. (b) Loss decomposition by data frequency class: high-frequency data is learned earlier, while lower-frequency data is acquired later during training.
  • ...and 6 more figures

Theorems & Definitions (30)

  • Definition 3.1: Kolmogorov structure function
  • Theorem 5.2
  • Corollary 5.3
  • Theorem 5.6
  • Example A.1
  • Proposition A.2
  • Lemma A.3: clarke1994jeffreysduchi2024lecturejeon2024information
  • proof
  • Definition B.1: "Best Fit" function
  • Definition B.2: Kolmogorov structure function
  • ...and 20 more