Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity

Jun Wan

Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity

Jun Wan

TL;DR

The paper addresses how two observed scaling laws for LLMs can be understood through lossless compression and conditional Kolmogorov complexity. It argues that pre-training scales by increasing execution steps via more parameters, while inference scales by adding intermediate tokens, and both converge toward better approximations of KC quantities $C(x,y)$ and $C(x|y)$. A central claim is that pre-training effectively computes an upper bound on the joint KC and that decoder-only transformers can, in principle, approximate the conditional KC in the limit, thereby unifying the two scaling laws under a common KC-ready framework. The work builds a theoretical bridge between minimum description length principles, KC theory, and practical scaling laws, offering a principled lens to analyze the resource implications and limits of large-scale language models.

Abstract

In 2020, OpenAI proposed the first type of Scaling Laws, describing the relationships between model loss and the scale of parameters, data, and training computation. In 2024, OpenAI proposed the second type of Scaling Laws, describing the relationship between model inference performance and inference computation. In this paper, we analyze LLMs training and inference processes from the perspective of lossless compression using conditional Kolmogorov complexity, and unify these two types of Scaling Laws. We find that both types of Scaling Laws improve approximation of conditional Kolmogorov complexity by increasing execution steps of Turing machine. The first type of Scaling Laws increases execution steps by increasing number of model parameters. The second type of Scaling Laws increases execution steps by increasing the number of intermediate tokens.

Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity

TL;DR

and

. A central claim is that pre-training effectively computes an upper bound on the joint KC and that decoder-only transformers can, in principle, approximate the conditional KC in the limit, thereby unifying the two scaling laws under a common KC-ready framework. The work builds a theoretical bridge between minimum description length principles, KC theory, and practical scaling laws, offering a principled lens to analyze the resource implications and limits of large-scale language models.

Abstract

Paper Structure (17 sections, 9 theorems, 15 equations, 1 figure)

This paper contains 17 sections, 9 theorems, 15 equations, 1 figure.

Introduction
Related Work
Background
Turing Machines, Neural Networks & LLMs
Dynamic Arithmetic Coding
LLMs & Lossless Compression
Kolmogorov complexity
Two Types of Scaling Laws
Analysis of LLMs Pre-training and Inference from the Perspective of Conditional Kolmogorov Complexity
The Relationship Between LLMs Pre-training and Conditional Kolmogorov Complexity
The Relationship Between LLMs Inference and Conditional Kolmogorov Complexity
Conclusion
Limitations
Example of Arithmetic Coding
Proof of Theorem
...and 2 more sections

Key Result

Theorem 4.1

Given a universal Turing machine $U$, the joint Kolmogorov complexity $C(x, y)$ of strings $x$ and $y$ satisfies the following inequality: where $l$ represents string length, and $O(1)$ is a constant related to Turing machine $U$.

Figures (1)

Figure 1: The left figure achieves more efficient compression of the string $(k, r)$ by increasing the model parameters, while the right figure enhances the compression efficiency of the string $(k, r)$ by introducing more intermediate tokens.

Theorems & Definitions (11)

Theorem 4.1
Corollary 4.1
Theorem 4.2
Corollary 4.2
Corollary 4.3
Corollary 4.4
Corollary 4.5
Theorem 4.1
proof
Theorem 4.2
...and 1 more

Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity

TL;DR

Abstract

Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (11)