Table of Contents
Fetching ...

Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity

Jun Wan

TL;DR

The paper addresses how two observed scaling laws for LLMs can be understood through lossless compression and conditional Kolmogorov complexity. It argues that pre-training scales by increasing execution steps via more parameters, while inference scales by adding intermediate tokens, and both converge toward better approximations of KC quantities $C(x,y)$ and $C(x|y)$. A central claim is that pre-training effectively computes an upper bound on the joint KC and that decoder-only transformers can, in principle, approximate the conditional KC in the limit, thereby unifying the two scaling laws under a common KC-ready framework. The work builds a theoretical bridge between minimum description length principles, KC theory, and practical scaling laws, offering a principled lens to analyze the resource implications and limits of large-scale language models.

Abstract

In 2020, OpenAI proposed the first type of Scaling Laws, describing the relationships between model loss and the scale of parameters, data, and training computation. In 2024, OpenAI proposed the second type of Scaling Laws, describing the relationship between model inference performance and inference computation. In this paper, we analyze LLMs training and inference processes from the perspective of lossless compression using conditional Kolmogorov complexity, and unify these two types of Scaling Laws. We find that both types of Scaling Laws improve approximation of conditional Kolmogorov complexity by increasing execution steps of Turing machine. The first type of Scaling Laws increases execution steps by increasing number of model parameters. The second type of Scaling Laws increases execution steps by increasing the number of intermediate tokens.

Unifying Two Types of Scaling Laws from the Perspective of Conditional Kolmogorov Complexity

TL;DR

The paper addresses how two observed scaling laws for LLMs can be understood through lossless compression and conditional Kolmogorov complexity. It argues that pre-training scales by increasing execution steps via more parameters, while inference scales by adding intermediate tokens, and both converge toward better approximations of KC quantities and . A central claim is that pre-training effectively computes an upper bound on the joint KC and that decoder-only transformers can, in principle, approximate the conditional KC in the limit, thereby unifying the two scaling laws under a common KC-ready framework. The work builds a theoretical bridge between minimum description length principles, KC theory, and practical scaling laws, offering a principled lens to analyze the resource implications and limits of large-scale language models.

Abstract

In 2020, OpenAI proposed the first type of Scaling Laws, describing the relationships between model loss and the scale of parameters, data, and training computation. In 2024, OpenAI proposed the second type of Scaling Laws, describing the relationship between model inference performance and inference computation. In this paper, we analyze LLMs training and inference processes from the perspective of lossless compression using conditional Kolmogorov complexity, and unify these two types of Scaling Laws. We find that both types of Scaling Laws improve approximation of conditional Kolmogorov complexity by increasing execution steps of Turing machine. The first type of Scaling Laws increases execution steps by increasing number of model parameters. The second type of Scaling Laws increases execution steps by increasing the number of intermediate tokens.
Paper Structure (17 sections, 9 theorems, 15 equations, 1 figure)

This paper contains 17 sections, 9 theorems, 15 equations, 1 figure.

Key Result

Theorem 4.1

Given a universal Turing machine $U$, the joint Kolmogorov complexity $C(x, y)$ of strings $x$ and $y$ satisfies the following inequality: where $l$ represents string length, and $O(1)$ is a constant related to Turing machine $U$.

Figures (1)

  • Figure 1: The left figure achieves more efficient compression of the string $(k, r)$ by increasing the model parameters, while the right figure enhances the compression efficiency of the string $(k, r)$ by introducing more intermediate tokens.

Theorems & Definitions (11)

  • Theorem 4.1
  • Corollary 4.1
  • Theorem 4.2
  • Corollary 4.2
  • Corollary 4.3
  • Corollary 4.4
  • Corollary 4.5
  • Theorem 4.1
  • proof
  • Theorem 4.2
  • ...and 1 more